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Preface 


This book was developed using the leanpub 1 platform. Please send 
any feedback or corrections to: bgweber@gma-il.com 

The data Science landscape is constantly evolving, because new 
tools and libraries are enabling smaller teams to dcliver more irn- 
pactful products. In the current state, data scientists are expected 
to build Systems that not only scale to a single product, but a 
portfolio of products. The goal of this book is to provide data 
scientists with a set of tools that can be used to build predictive 
model Services for product teams. 

This text is meant to be a Data Science 201 course for data Science 
practitioners that want to develop skills for the applied Science dis- 
cipline. The target audience is readers with past experience with 
Python and scikit-learn than want to learn how to build data prod¬ 
ucts. The goal is to get readers hands-on with a number of tools 
and cloud environments that they would use in industry settings. 


0.1 Prerequisites 

This book assumes that readers liave prior knowledge of Python 
and Pandas, as wcll as some experience with modcling packages 
such as scikit-learn. This is a book that will focus on breadth 
rather than depth, where the goal is to get readers hands on with 
a number of different tools. 

Python has a large library of books available, covering the lan- 
guage fundamentals, specihc packages, and disciplines such as data 
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viii Preface 

Science. Here are some of the books I would recommend for readers 
to build additional knowledge of the Python ecosystem. 

• Python And Pandas 

— Data Science from Scratch (Grus, 2015): Introduces 
Python from a data Science perspective. 

— Python for Data Analysis (McKinney, 2017): Provides ex¬ 
tensive details on the Pandas library. 

• Machine Learning 

— Hands-On Machine Learning (Geron, 2017): Covers scikit- 
learn in depth as well as TensorFlow and Keras. 

— Deep Learning for Python (Chollct, 2017): Provides an ex¬ 
cellent introduction to deep learning concepts using Keras 
as the core framework. 

I will walk through the code samples in this book in detail, but 
will not cover the fundamentals of Python. Readers may hnd it 
useful to hrst explore these texts before digging into building large 
scale pipclines in Python. 


0.2 Book Contents 

The general theme of the book is to take simple machine learning 
models and to scale them up in different conkgurations across mul¬ 
tiple cloud environments. Here’s the topics covered in this book: 

1. Introduction: This chapter will motivate the use of 
Python and discuss the discipline of applicd data Science, 
present the data sets, models, and cloud environments 
used throughout the book, and provide an overview of 
automated feature engineering. 

2. Models as Web Endpoints: This chapter shows how 
to use web endpoints for consuming data and hosting ma¬ 
chine learning models as endpoints using the Flask and 
Gunicorn libraries. We’ll start with scikit-learn models 
and also set up a deep learning endpoint with Keras. 
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3. Models as Serverless Functions: This chapter will 
build upon the previous chapter and show how to set 
up model endpoints as serverless functions using AWS 
Lambda and GCP Cloud Functions. 

4. Containers for Reproducible Models: This chapter 
will show how to use containers for deploying models with 
Docker. We’ll also explore scaling up with ECS and Ku- 
bernetes, and building web applications with Plotly Dash. 

5. Workflow Tools for Model Pipelines: This chapter 
focuses on scheduling automated workflows using Apache 
Airflow. We’ll set up a model that pulls data from Big- 
Query, applies a model, and saves the results. 

6. PySpark for Batch Modeling: This chapter will intro- 
duce readers to PySpark using the community edition of 
Databricks. We’ll build a batch model pipeline that pulls 
data from a data lake, generates features, applies a model, 
and stores the results to a No SQL database. 

7. Cloud Dataflow for Batch Modeling: This chapter 
will introduce the core components of Cloud Dataflow and 
implement a batch model pipeline for reading data from 
BigQuery, applying an ML model, and saving the results 
to Cloud Datastore. 

8. Streaming Model Workflows: This chapter will intro¬ 
duce readers to Kafka and PubSub for streaming mes- 
sages in a cloud environment. After working through this 
material, readers will learn how to use these message bro- 
kers to creating streaming model pipelines with PySpark 
and Dataflow that provide near real-time predictions. 

After working through this material, readers should liave hands-on 
experience with many of the tools needed to build data products, 
and have a better understanding of how to build scalable machine 
learning pipelines in a cloud environment. 
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0.3 Code Examples 

Since the focus of this book is to get readers hands on with Python 
code, I liave provided code listings for a subset of the chapters on 
GitHub. The following URL provides listings for code examples 
that work well in a Jupyter environment: 

• https://github.com/bgweber/DS_Production 

Due to formatting restrictions, many of the code snippets in this 
book break commands into multiple lines while omitting the con- 
tinuation operator (\). To get code blocks to work in Jupyter or an- 
other Python coding environment, you may need to remove these 
line breaks. The code samples in the notebooks listed above do 
not add these line breaks and can be exeeuted without modifi- 
cation, excluding credential and IP changes. This book uses the 
ternis scikit-learn and sklearn illterchailgeably with sklearn used 
explicitly in Section 3.3.3. 


0.4 Acknowledgements 

1 was able to author this book using Yihui Xie’s excellent book- 
down package (Xie, 2015). For the design, I used Shashi Knmar’s 
template 2 available under the Creative Commons 4.0 license. The 
book cover uses Cedric Franchetti’s image from pxhere 3 . 

This book was last updated on December 31, 2019. 


^https://bit.ly/2Mj FDgV 
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Introduction 


Putting predictive models into production is one of the most di- 
rect ways that data scientists can add value to an organization. By 
learning how to build and deploy scalable model pipelines, data sci¬ 
entists can own more of the model production process and rapidly 
deliver data products. Building data products is more than just 
putting code into production, it also includes DevOps and lifecy- 
cle management of live Systems. 

Throughout this book, we’ll cover different cloud environments 
and tools for building scalable data and model pipelines. The goal 
is to provide readers witli the opportunity to get hands on and 
start building experience with a number of different tools. While 
this book is targeted at analytics practitioners with prior Python 
experience, we’ll walk through examples frorn start to hnish, but 
woiTt dig into the details of the programming language itself. 

The role of data Science is constantly transforming and adding 
new specializations. Data scientists that build production-grade 
Services are often called applicd scientists. Their goal is to build 
Systems that are scalable and robust. In order to be scalable, we 
need to use tools that can parallelize and distribute code. Parallcliz- 
ing code means that we can perform multiple tasks simultaneously, 
and distributing code means that we can scale up the number of 
machines needed in order to accomplish a task. Robust Services 
are Systems that are resilient and can recover from failure. While 
the focus of this book is on scalability rather than robustness, we 
will cover monitoring systems in production and discuss measuring 
model performance over time. 

During my career as a data scientist, Pve worked at a number of 
video garne companies and have had experience putting propensity 
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models, lifetime-value predictions, and recommendation Systems 
into production. Overall, this process has become more stream- 
lined witli the development of tools such as PySpark, which enable 
data scientists to more rapidly build end-to-end products. While 
many companies now have engineering teams witli machine learn- 
ing focuses, it’s valuable for data scientists to have broad expertise 
in productizing models. Owning more of the process means that a 
data Science team can deliver products quicker and iterate much 
more rapidly. 

Data products are useful for organizations, because they can pro¬ 
vide personalization for the user base. For example, the recommen¬ 
dation system that 1 designed for EverQuest Landmark 1 provided 
curated content for players from a marketplace with thousands 
of user-created items. The goal of any data product should be 
creating value for an organization. The recommendation system 
accomplished this goal by increasing the revenue generated from 
user-created content. Propensity models, which predict the likeli- 
liood of a user to perform an action, can also have a direct impact 
on core metrics for an organization, by enabling personalized ex- 
periences that increase user engagement. 

The process used to productize models is usually unique for each 
organization, because of different cloud environments, databases, 
and product organizations. However, many of the sanie tools are 
used within these workflows, such as SQL and PySpark. Your or¬ 
ganization may not be using the sanie data ecosystem as these 
examples, but the methods should transfer to your use cases. 

In this chapter, we will introduce the role of applied Science and 
motivate the usage of Python for building data products, discuss 
different cloud and coding environments for scaling up data Science, 
introduce the data sets and types of models used throughout the 
book, and introduce automated feature engineering as a step to 
include in data Science workflows. 


^https://bit.ly/2YF!YPg 
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1.1 Applied Data Science 

Data Science is a broad discipline with many different specializa- 
tions. One distinction that is becoming common is product data 
Science and applied data Science. Product data scientists are typ- 
ically embedded on a product team, such as a game studio, and 
provide analysis and modeling that helps the team directly im- 
prove the product. For example, a product data scientist might 
find an issue with the first-time user experience in a game, and 
make recommendations such as which languages to focus on for 
localization to improve new user retention. 

Applied Science is at the intersection of machine learning engineer- 
ing and data Science. Applied data scientists focus on building data 
products that product teams can integrate. For example, an ap¬ 
plied scientist at a game publisher might build a recommendation 
Service that different game teams can integrate into their products. 
Typically, tliis role is part of a Central team that is responsible for 
owning a data product. A data product is a production system 
that provides predictive models, such as identifying which items a 
player is likely to buy. 

Applied scientist is a job title that is growing in usage across tech 
companies including Amazon, Facebook, and Microsoft. The need 
for this type of role is growing, because a single applied scientist 
can provide tremendous value to an organization. For example, in- 
stead of liaving product data scientists build bespoke propensity 
models for individual games, an applied scientist can build a scal- 
able approach that provides a similar Service across a portfolio of 
games. At Zynga, one of the data products that the applied data 
Science team built was a system called AutoModel 2 , which pro- 
vided several propensity models for ali of our games, such as the 
likclihood for a specific player to churn. 

There’s been a few developments in technology that have made 
applied Science a reality. Tools for automated feature engineering, 

^https://ubm.io/2KdYRDq 
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such as deep learning, and scalable computing environments, such 
as PySpark, have enabled companies to build large scale data prod- 
ucts with smaller team sizes. Instead of hiring engineers for data 
ingestion and warehousing, data scientists for predictive modeling, 
and additional engineers for bnilding a machine learning infras- 
tructure, you can now use managed Services in the cloud to enable 
applied scientists to take on more of the responsibilities previously 
designated to engineering tearns. 

One of the goals of this book is to lielp data scientists rnake the 
transition to applied Science, by providing hands-on experience 
with different tools that can be used for scalable compute and 
standing up Services for predictive models. We will work through 
different tools and cloud environments to build proof of concepts 
for data products that can translate to production environments. 


1.2 Python for Scalable Compute 

Python is quickly becoming the de facto language for data Science. 
In addition to the huge library of packages that provide useful 
functionality, one of the reasons that the language is becoming 
so popular is that it can be used for building scalable data and 
predictive model pipclines. You can use Python on your local ma¬ 
chine and build predictive models with scikit-learn, or use environ¬ 
ments such as Dataflow and PySpark to build distributed Systems. 
While these different environments use different libraries and pro- 
gramming paradigms, it’s ali in the same language of Python. It’s 
no longer necessary to translate an R script into a production 
language such as Java, you can use the same language for both 
development and production of predictive models. 

It took me awhile to adopt Python as my data Science language 
of choice. Java had been my preferred language, regardless of task, 
since early in my undergraduate career. For data Science tasks, I 
used tools like Weka to train predictive models. I stili hnd Java to 
be useful when building data pipelines, and it’s great to know in 
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order to directly collaborate with engineering teams on projects. I 
later switched to R while working at Electronic Arts, and found 
the transition to an interactive coding environment to be quite 
useful for data Science. One of the features I really enjoyed in R is 
R Markdown, wliich you can use to write documents with inline 
code. In fact, this entire book was written using an extension of R 
Markdown called bookdown (Xie, 2019). I later switched to using 
R within Jupyter notebooks and even wrote a book on using R 
and Java for data Science at a startup (Weber, 2018). 

When I started working at Zynga in 2018, I adopted Python and 
haven’t looked back. It took a bit of time to get used to the new 
language, but there are a number of reasons that I wanted to learn 
Python: 

• Momentum: Many teams are already using Python for produc- 
tion, or portions of their data pipelines. It makes sense to also 
use Python for performing analysis tasks. 

• PySpark: R and Java don’t provide a good transition to author- 
ing Spark tasks interactively. You can use Java for Spark, but 
it’s not a good fit for exploratory work, and the transition from 
Python to PySpark seems to be the most approachable way to 
learn Spark. 

• Deep Learning: I’m interested in deep learning, and while there 
are R bindings for libraries such as Keras, it’s better to code in 
the native language of these libraries. I previously used R to 
autlior custom loss functions, and debugging errors was prob- 
lematic. 

• Libraries: In addition to the deep learning libraries offered for 
Python, there’s a number of other useful tools including Flask 
and Bokeh. There’s also notebook environments that can scale 
including Google’s Colaboratory and AWS SageMaker. 

To ease the transition from R to Python, I took the following steps: 

• Focus on outcomes, not semantics: Instead of learning 
about ali of the fundamentals of the language, I first focused 
on doing in Python what I already knew how to do in other 
languages, such as training a logistic regression model. 
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• Learn the ecosystem, not the language: I didn’t limit myself 
to the base language when learning, and instead jumped right in 
to using Pandas and scikit-learn. 

• Use cross-language libraries: I was already familiar working 
witli Keras and Plotly in R, and used knowledge of these libraries 
to bootstrap learning Python. 

• Work with real-world data: I used the data sets provided by 
Google’s BigQuery to test out my scripts on large-scale data. 

• Start locally if possible: While one of my goals was to learn 
PySpark, I hrst focused on getting things up and running on my 
local machinc before moving to cloud ecosystems. 

There are many situations where Python is not the best choice for a 
specihc task, but it does have broad applicability when prototyping 
models and building scalable modcl pipelines. Because of PythoiTs 
rich ecosystem, we will be using it for all of the examples in tliis 
book. 


1.3 Cloud Environments 

In order to build scalable data Science pipelines, it’s necessary to 
move beyond single machine scripts and move to clusters of ma- 
cliines. While this is possible to do with an on-premise setup, a 
common trend is using cloud computing environments in order to 
achieve large-scale processing. There’s a number of different op- 
tions available, with the top three platforms currently being Ama¬ 
zon Web Services (AWS), Google Cloud Platform (GCP), and Mi¬ 
crosoft Azure. 

Most cloud platforms offer free credits for getting started. GCP 
offers a $300 credit for new users to get hands on with their tools, 
while AWS provides free-tier access for many Services. In this book, 
we’ll get hands on with both AWS and GCP, with little to no cost 
involved. 
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1.3.1 Amazon Web Services (AWS) 

AWS is currently the largest cloud provider, and this dominance 
has been demonstrated by the number of gaming companies using 
the platform. I’ve had experience working with AWS at Electronic 
Arts, Twitch, and Zynga. This platform has a wide range of tools 
available, but getting these components to work well together gen- 
erally takes additional engineering work versus GCP. 

With AWS, you can use both self-hosted and managed Solutions 
for building data pipclines. For example, the managed option for 
messaging on AWS is Kinesis, wliile the self-hosted option is Kafka. 
We’ll walk through examples with both of these options in Chapter 
8. There’s typically a tradeoff between cost and DevOps when 
choosing between self-hosted and managed options. 

The default database to use on AWS is Redshift, whicli is a colum- 
nar database. This option works well as a data warehouse, but it 
doesn’t scale well to data lake volumes. It’s common for organiza- 
tions to set up a separate data lake and data warehouse on AWS. 
For example, it’s possible to store data on S3 and use tools such 
as Athena to provide data lake functionality, whilc using Redshift 
as a solution for a data warehouse. This approach has worked well 
in the past, but it creates issues when building large-scale data 
Science pipelines. Moving data in and out of a relational database 
can be a bottleneck for these types of workflows. One of the So¬ 
lutions to this bottleneck is to use vendor Solutions that separate 
storage frorn compute, such as Snowflake or Delta Lake. 

The first component we’ll work with in AWS is Elastic Com¬ 
pute (EC2) instances. These are individual Virtual machines that 
you can spin up and provision for any necessary task. In section 
1.4.1, we’ll show how to set up an instance that provides a re¬ 
mote Jupyter environment. EC2 instances are great for getting 
started with tools such as Flask and Gunicorn, and getting started 
with Docker. To scale up beyond individual instances, we’ll explore 
Lambda functions and Elastic Container Services. 

To build scalable pipelines on AWS, we’ll focus on PySpark as the 
main environment. PySpark enables Python code to be distributed 
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across a cluster of machines, and vendors such as Databricks pro¬ 
vide managed environments for Spark. Another option that is avail- 
able only on AWS is SageMaker, which provides a Jupyter note- 
book environment for training and deploying models. We are not 
covering SageMaker in this book, because it is specific to AWS 
and currently supports only a subset of predictive models. Instead, 
we’ll explore tools such as MLflow. 

1.3.2 Google Cloud Platform (GCP) 

GCP is currently the third largest cloud platform provider, and 
offers a wide range of managed tools. It’s currently being used by 
large media companies such as Spotify, and witliin the garnes in- 
dustry being used by King and Niantic. One of the main benefits of 
using GCP is that many of the components can be wired together 
using Dataflow, which is a tool for building batch and streaming 
data pipclines. Wc’ll create a batch model pipclinc with Dataflow 
in Chapter 7 and a streaming pipeline in Chapter 8. 

Google Cloud Platform currently offers a smaller set of tools than 
AWS, but there is feature parity for many common tools, such as 
PubSub in place of Kinesis, and Cloud Functions in place of AWS 
Lambda. One area where GCP provides an advantage is BigQuery 
as a database solution. BigQuery separates storage from compute, 
and can scale to botli data lake and data warehouse use cases. 

Dataflow is one of the most powerful tools for data scientists that 
GCP provides, because it empowers a single data scientist to build 
large-scale pipclines with much less effort than other platforms. 
It enables building streaming pipclines that connect PubSub for 
messaging, BigQuery for analytics data Stores, and BigTable for 
application databases. It’s also a managed solution that can au- 
toscale to meet demand. While the original version of Dataflow 
was specific to GCP, it’s now based on the Apache Bearn library 
which is portable to other platforms. 
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1.4 Coding Environments 

There’s a variety of options for writing Python code in order to do 
data Science. The best environment to use likely varies based on 
what you are building, but notebook environments are becoming 
more and more common as the place to write Python Scripts. The 
three types of coding environments I’ve worked with for Python 
are IDEs, text editors, and notebooks. 

If you’re used to working with an IDE, tools like PyCharm and 
Rodeo are useful editors and provide additional tools for debugging 
versus other options. It’s also possible to write code in text editors 
such as Sublime and then run Scripts via the command line. I 
hnd this works well for building web applications with Flask and 
Dasii, where you need to have a long running script that persists 
beyond the scope of running a cell in a notebook. I now perform 
the majority of my data Science work in notebook environments, 
and this covers exploratory analysis and productizing models. 

I like to work in coding environments that make it trivial to share 
code and collaborate on projects. Databricks and Google Colab 
are two coding environments that provide truly collaborative note¬ 
books, where multiple data scientists can simultaneously work on 
a script. When using Jupyter notebooks, this level of real-time 
collaboration is not currently supported, but it’s good practice to 
share notebooks in version control Systems such as GitHub for 
sharing work. 

In this book, we’ll use only the text editor and notebook environ¬ 
ments for coding. For learning how to build scalable pipelines, I 
recommend working on a remote machine, such as EC2, to become 
more familiar with cloud environments, and to build experience 
setting up Python environments outside of your local machine. 

1.4.1 Jupyter on EC2 

To get experience with setting up a remote machine, we’ll start by 
setting up a Jupyter notebook environment on a EC2 instance in 
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Instance ID 

i-0b3e8052d00a567d9 

Public DNS (IPv4) 

ec2-54-87-230-152.compute- 

1. amazonaws.com 

Instance state 

running 

IPv4 Public IP 

54.87.230.152 

Instance type 

t2.micro 

IPv6 IPs 

- 

Elastic IPs 


Private DNS 

ip-172-31 -53-82. ec2.intemal 

Availability zone 

us-east-1 a 

Private IPs 

172.31.53.82 

Security groups 

launch-wizard-66. view inbound 

rules. view outbound rules 

Secondary private IPs 


FIGURE 

1.1: Public and Private IPs on EC2 



AWS. The resuit is a remote machine that we can use for Pythou 
scripting. Accomplishiug this task requires spinniug up an EC2 
instance, configuring firewall settings for the EC2 instance, con- 
necting to the instance using SSH, and running a few commands 
to deploy a Jupyter environment on the machine. 

The first step is to set up an AWS account and log into the AWS 
management console. AWS provides a free account option witli 
free-tier access to a number of Services including EC2. Next, pro- 
vision a machine using the following steps: 

1. Under “Find Services”, search for EC2 

2. Click “Launch Instance” 

3. Select a free-tier Amazon Linux AMI 

4. Click “Review and Launch”, and then “Launch” 

5. Create a key pair and save to your local machine 

6. Click “Launch Instances” 

7. Click “View Instances” 

The machine may take a few minutes to provision. Once the ma¬ 
chine is ready, the instance state will be set to “running”. We can 
now connect to the machine via SSH. One note on the different 
AMI options is that some of the configurations are set up witli 
Python already installed. However, this book focuses on Python 3 
and the included version is often 2.7. 

There’s two different IPs that you need in order to connect to the 
machine via SSH and later connect to the machine via web browser. 
The public and private IPs are listed under the “Description” tab 
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ifri Bitvise SSH Client 8.34 
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Default profile 

Login Options Terminal 
Server 
Host 
Port 


Load profile 

rwas 

i • ff 

I W\ 

Save profile as 


jS 

New profile 


Reset profile 


SFTP Services C2S S2C 
Authentication 


54.87.230.152 


Usemame 


- □ X 

Closing and minimization 

SSH Notes About 

ec2-user 


I I Enable obfuscation 


Obtuscatlon keyword 


Kerberos 
SPN 


Initial method publickey 

Client kev Global 1 

Eassphrase 





Elevation Default 


I | GSS/Kerberos key exchange 

Request delegation 
gssapi-keyex authentication 


Proxy settings Host kev manager Client key manager 

FIGURE 1.2: SSH connection settings. 
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as shown in Figure 1.1. To connect to the machine via SSH we’ll 
use the Public IP (54.87.230.152). For connecting to the machine, 
you’11 need to use an SSH client such as Putty if working in a 
Windows environment. For Linux and Mac OS, you can use ssh 
via the command line. To connect to the machine, use the user 
name “ec2-user” and the key pair generated when launching the 
instance. An example of connecting to EC2 using the Bitvise client 
on Windows is shown in Figure 1.2. 

Once you connect to the machine, you can check the Python ver- 
sion by running python --version. On my machine, the resuit was 
2.7.16, meaning that additional setup is needed in order to up- 
grade to Python 3. We can run the following commands to install 
Python 3, pip, and Jupyter. 


sudo yum install -y python37 
python3 --version 

curi https://bootstrap.pypa.io/get-pip.py -o get-pip.py 
sudo python3 get-pip.py 
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pip --version 

pip install —user jupyter 


The two version commands are to confirm that the machine is 
pointing at Python 3 for both Python and pip. Once Jupyter is 
installed, wc’ll need to set up a firewall restriction so that we can 
connect directly to the machine on port 8888, where Jupyter runs 
by default. This approach is the quickest way to get connected to 
the machine, but it’s advised to use SSH tunneling to connect to 
the machine rather than a direct connection over the open web. 
You can open up port 8888 for your local machine by performing 
the following steps from the EC2 console: 

1. Select your EC2 instance 

2. Under “Description”, select security groups 

3. Click “Actions” -> “Edit Inbound Rules” 

4. Add a new rule: change the port to 8888, select “My IP” 

5. Click “Save” 

We can now run and connect to Jupyter on the EC2 machine. To 
launch Jupyer, run the command shown below while replacing the 
IP with your EC2 instance’s Private IP. It is necessary to specify 
the —ip parameter in order to enable remote connections to Jupyer, 
as incoming traffic will be routed via the private IP. 

jupyter notebook --ip 172.31.53.82 

When you run the jupyter notebook command, you’11 get a URL 
with a token that can be used to connect to the machine. Be- 
fore entering the URL into your browser, you’11 need to swap the 
Private IP output to the console with the Public IP of the EC2 
instance, as shown in the snippet below. 

# Original URL 

The Jupyter Notebook is running at: 
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jupyter Quit Logout 

Files Running Clusters 
Select items to perform actions on them 
| Q 0 - MI 

□ 0 get-pip py 

FIGURE 1.3: Jupyter Notebook on EC2. 


Upload New ▼ C 
Name * Last Modified File size 
9minutesago 178 MB 


h ttp:// 172.31.53.82:8888/?token= 

98175f620fd68660d26fa7970509c6c49ec2afc280956a26 

# Swap Private IP with Public IP 
http://54.87.230. 152:8888/?token= 

98175f620fd68660d26fa7970509c6c49ec2afc280956a26 

You can now paste the updated URL into your browser to connect 
to Jupyter on the EC2 machine. The resuit should be a Jupyer 
notebook fresh install with a single hle get-pip.py in the base di- 
rectory, as shown in Figure 1.3. Now that we liave a machine set 
up with Python 3 and Jupyter notebook, we can start exploring 
different data sets for data Science. 


1.5 Datasets 

To build scalable data pipelines, we’ll need to switch from using 
local hies, such as CSVs, to distributed data sources, such as Par- 
quet hies on S3. Wliile the tools used across cloud platforms to 
load data vary signihcantly, the end resuit is usually the same, 
which is a dataframe. In a single machine environment, we can use 
Pandas to load the dataframe, while distributed environments use 
different implementations such as Spark dataframes in PySpark. 

This section will introduce the data sets that wc’ll explore through- 
out the rest of the book. In this chapter we’ll focus on loading the 
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data using a single machine, while later chapters will present dis- 
tributed approaches. While most of the data sets presented here 
can be downloaded as CSV hies as read into Pandas using read_csv, 
it’s good practice to develop automated workhows to connect to di¬ 
verse data sources. We’ll explore the following datasets throughout 
this book: 

• Boston Housing: Records of sale prices of liomes in the Boston 
housing market back in 1980. 

• Game Purchases: A synthetic data set representing games pur- 
chased by different users on XBox One. 

• Natality: One of BigQuery’s open data sets on birth statistics 
in the US over multiple decades. 

• Kaggle NHL: Play-by-play events frorn professional hockey 
games and game statistics over the past decade. 

The hrst two data sets are single commands to load, as long as 
you liave the required libraries installed. The Natality and Kaggle 
NHL data sets require setting up authentication hies before you 
can programmatically pull the data sources into Pandas. 

The hrst approach we’ll use to load a data set is to retrieve it di- 
rectly frorn a library. Multiple libraries include the Boston housing 
data set, because it is a small data set that is useful for testing out 
regression models. We’ll load it frorn scikit-learn by hrst running 
pip frorn the command line: 

pip install —user pandas 
pip install --user sklearn 


Once scikit-learn is installed, we can switch back to the Jupyter 
notebook to explore the data set. The code snippet below shows 
how to load the scikit-learn and Pandas libraries, load the Boston 
data set as a Pandas dataframe, and display the hrst 5 records. 
The resuit of running these commands is shown in Figure 1.4. 


frorn sklearn.datasets import load_boston 
import pandas as pd 
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CRIM 

ZN 

INDUS 

CHAS 

NOX 

RM 

AGE 

DIS 

RAD 

TAX 

PTRATIO 

B 

LSTAT 
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0 00632 
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0.0 

0538 

6575 

65.2 

4.0900 

1.0 

296.0 

15.3 

396.90 

4.98 

24.0 

002731 

0.0 

7 07 

0.0 

0.469 

6.421 

78.9 

4.9671 

2.0 

242.0 

17.8 

396.90 

9.14 

21.6 

0.02729 

0.0 

7 07 

0.0 

0469 

7 185 

61.1 

4 9671 

2.0 

242.0 

17.8 

392.83 

4.03 

347 

0.03237 

0.0 

2.18 

0.0 

0.458 

6 998 

45.8 

6.0622 

3.0 

222.0 

18.7 

394 63 

2.94 

33.4 

0.06905 

0.0 

2.18 

0.0 

0.458 

7.147 

54.2 

6.0622 

3.0 

222.0 

18.7 

396.90 

5.33 

36.2 


FIGURE 1.4: Boston Housing data set. 


data, target = load_boston (True) 

bostonDF = pd. DataFrame(data , columns=load_boston ().feature_names) 
bostonDF[ 'label'] = target 
bostonDF.head () 

The second approach we’ll use to load a data set is to fetch it 
from the web. The CSV for the Games data set is available as a 
single file on GitHub. We can fetch it into a Pandas dataframe by 
using the read_csv function and passing the URL of the file as a 
parameter. The resuit of reading the data set and printing out the 
first few records is shown in Figure 1.5. 


gamesDF = pd.read_csv ( "https://github.com/bgweber/ 

Twitch/raw/master/Recommendations/games-expand.csv") 
gamesDF.head () 


Both of these approaches are similar to downloading CSV hies and 
reading them from a local directory, but by using these methods 
we can avoid the manual step of downloading Hies. 

1.5.1 BigQuery to Pandas 

One of the ways to automate workhows authored in Python is to 
directly connect to data sources. For databases, you can use con- 
nectors based on JDBC or native connectors, such as the bigquery 
module provided by the Google Cloud library. This connector en- 
ables Python applications to send queries to BigQuery and load 
the results as a Pandas dataframe. This process involves setting up 
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G1 G2 G3 G4 G5 G6 G7 G8 


0 0 
0 0 
0 0 
0 0 
0 0 


0 1 
0 0 
1 0 
1 0 
1 0 


0 0 
1 0 
0 0 
0 1 
1 1 


0 0 
0 0 
0 0 
1 0 
0 1 


G9 G10 label 

0 0 0 

0 0 0 

0 0 0 

0 1 1 

1 0 1 


FIGURE 1.5: Game Purchases data set. 


a GCP project, installing the prerequisite Python libraries, setting 
up the Google Cloud command line tools, creating GCP creden- 
tials, and hnally sending queries to BigQuery programmatically. 

If you do not already have a GCP acconnt set up, you’ll need to 
create a new account 3 . Google provides a $300 credit for getting 
up and running with the platform. The hrst step is to install the 
Google Cloud library by running the following steps: 


pip install —user google-cloud-bigquery 
pip install --user matplotlib 


Next, we’ll need to set up the Google Cloud command line tools, 
in order to set up credentials for connecting to BigQuery Wliile 
the hies to use will vary based on the current release 4 , here are the 
steps 1 ran on the command line: 

curi -0 https://dl.google.com/dl/cloudsdk/channels/ 
rapid/downloads/google-cloud-sdk-255.0.0- 
linux-x86_64.tar.gz 

tar zxvf google-cloud-sdk-255.0.0-linux-x86_64.tar.gz 
google-cloud-sdk 
./google-cloud-sdk/install.sh 


3 

https://cloud.google.com/gcp 
^https://cloud.google.com/sdk/install 
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Once the Google Cloud command line tools are installed, we can 
set up credentials for connecting to BigQuery: 


gcloud config set project project_name 
gcloud auth login 
gcloud init 

gcloud iam service-accounts create dsdemo 
gcloud projects add-iam-policy-binding your_project_id 
--member "serviceAccount:dsdemo@your_project_id.iam. 

gserviceaccount.com" --role "roles/owner" 
gcloud iam service-accounts keys 

create dsdemo.json --iam-account 
dsdemo@your_project_id.iam.gserviceaccount.com 
export GOOGLE_APPLICATION_CREDENTIALS= 

/home/ec2-user /dsdemo.json 


You’ll need to substitute project_name with your project name, 
your_project_id with your project ID, and dsdemo with your desired 
Service account name. The resuit is a json file with credentials 
for the Service account. The export command at the end of this 
process telis Google Cloud where to hnd the credentials file. 

Setting up credentials for Google Cloud is involved, but generally 
only needs to be performed once. Now that credentials are conhg- 
ured, it’s possible to directly query BigQuery from a Python script. 
The snippet below sliows how to load a BigQuery client, send a 
SQL query to retrieve 10 rows from the natality data set, and pull 
the results into a Pandas dataframe. The resulting dataframe is 
shown in Figure 1.6. 


from google.cloud import bigquery 
client = bigquery.Ciient () 

sql = . 

SELECT * 

FROM 'bigquery-public-data.samples.natality' 
limit 10 


II II II 
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source_year 

year 
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day 

wday 
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9 
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HI 
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6 
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HI 
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1.0 

2 
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11 
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NaN 

HI 

False 

7.0 

8.437091 

1.0 

3 

1972 

1972 

11 

10.0 

NaN 

HI 

True 

7.0 

7.374463 

1.0 

4 

1973 

1973 

12 

260 

NaN 

HI 

False 

7.0 

5.813590 

1.0 


5 rows x 31 columns 


FIGURE 1.6: Previewing the BigQuery data set. 


natalityDF = client.query (sql) .to_dataframe () 
natalityDF.head() 


1.5.2 Kaggle to Pandas 

Kaggle is a data Science website tliat provides thousands of open 
data sets to explore. Wliile it is not possible to pull Kaggle data 
sets directly into Pandas dataframes, we can use the Kaggle library 
to programmatically download CSV files as part of an automated 
workflow. 

The quickest way to get set up with this approach is to create an 
account on Kaggle 5 . Next, go to the account tab of your prohle 
and select ‘Create API Token’, download and open the file, and 
then run vi .kaggle/kaggle. json on your EC2 instance to copy over 
the contents to your remote machine. The resuit is a credential file 
you can use to programmatically download data sets. We’ll explore 
the NHL (Hockey) data set by running the following connnands: 

pip install kaggle —user 


kaggle datasets download martinellis/nhl-game-data 
unzip nhl-game-data.zip 
chmod 0600 *.csv 


5 

https://www.kaggle.com 
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1 


FIGURE 1.7: NHL Kaggle data set. 


Tliese commands will download the data set, unzip the files into 
the current directory, and enable read access on the files. Now 
tliat the hies are downloaded on the EC2 instance, we can load 
and display the Game data set, as shown in Figure 1.7. Tliis data 
set includes different hies, where the game hle provides game-level 
summaries and the game_plays hle provides play-by-play details. 


import pandas as pd 

nhlDF = pd.read_csv ( 1 game.csv' ) 

nhlDF.head () 

We walked through a few different methods for loading data sets 
into a Pandas dataframe. The common theme with these different 
approaches is that we want to avoid manual steps in our workhows, 
in order to automate pipclines. 


1.6 Prototype Models 

Machine learning is one of the rnost important steps in the pipelinc 
of a data product. We can use predictive models to identify wliich 
users are most likely to purchase an item, or wliich users are most 
likely to stop using a product. The goal of this section is to present 
simple versions of predictive models that we’ll later scale up in 
more complex pipclines. This book will not focus on state-of-the- 
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art models, but instead cover tools that can be applicd to a variety 
of different machine learning algorithms. 

The library to use for implementing different models will vary 
based on the cloud platform and execution environment being used 
to deploy a model. The regression models presented in tliis section 
are built with scikit-learn, wliile the models we’ll build out witli 
PySpark use MLlib. 

1.6.1 Linear Regression 

Regression is a common task for supervised learning, such as pre- 
dicting the value of a horne, and linear regression is a useful algo- 
rithm for making predictions for these types of problems. scikit- 
learn provides both linear and logistic regression models for mak¬ 
ing predictions. We’ll start by using the LinearRegression class in 
scikit-learn to predict horne prices for the Boston housing data set. 
The code snippet below shows how to split the Boston data set into 
different training and testing data sets and separate data (train_x) 
and label (train_y) objects, create and fit a linear regression model, 
and calculate error metrics on the test data set. 


from sklearn.Iinear_model import LinearRegression 
from sklearn.model_selection import train_test_split 

# See Section 1.4 (Boston Hosing data set) 
bostonDF = ... 

x_train, x_test, y_train, y_test = train_test_split ( 

bostonDF.drop(['label l ],axis=l),bostonDF['label 1 ], test_size=0.3) 

model = LinearRegression () 
model.fit (x_train, y_train) 

print("R A 2: " + str(model. score(x_test, y_test))) 
print("Mean Error: " + str(sum( 

abs(y_test - model.predict (x_test) )) /y_test.count ())) 
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The trai n_test_spii t function is used to split up the data set into 
70% train and 30% holdout data sets. The first parameter is the 
data attributes from the Boston dataframe, with the label dropped, 
and the second parameter is the labeis from the dataframe. The 
two commands at the end of the script calculate the R-squared 
valne based on Pearson correlation, and the mean error is defined 
as the mean difference between predicted and actual horne prices. 
The output of this script was an R 2 valne of 0.699 and mean error of 
3.36. Since house prices in this data set are divided by a thousand, 
the mean error is $3.36k. 

We now have a simple model that we can productize in a nnmber of 
different environments. In later sections and chapters, wc’ll explore 
methods for scaling features, supporting more complex regression 
models, and automating feature generation. 

1.6.2 Logistic Regression 

Logistic regression is a supervised classification algorithm that is 
usefnl for predicting which users are likcly to perform an action, 
such as purchasing a product. Using scikit-learn, the process is 
similar to fitting a linear regression model. The main differences 
from the prior script are the data set being used, the model object 
instantiated (Log-isticRegression), and using the predict_proba func¬ 
tion to calculate error metrics. This function predicts a probability 
in the continuous range of [0,1] rather than a specific label. The 
snippet below predicts which users are likely to purchase a specific 
game based on prior games already purchased: 


from sklearn.Iinear_model import LogisticRegression 
from sklearn.metrics import roc_auc_score 
import pandas as pd 

# Games data set 

gamesDF = pd.read_csv ("https://github.com/bgweber/Twitch/raw/ 

master/Recommendations/games-expand.csv" ) 
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x_train, x_test, y_train, y_test = train_test_split ( 

gamesDF.drop([' "label'] ,axis=l),gamesDF[ 'label' ],test_size=0.3) 

model = LogisticRegression () 
model.fit (x_train, y_train) 


print ( "Accuracy: " + str(model. score(x_test, y_test))) 
print("ROC: " + str ( roc_auc_score (y_test, 

model.predict_proba (x_test)[:, 1] ))) 

The output of this script is two metrics that describe the perfor- 
mance of the model on the holdout data set. The accuracy metric 
describes the number of correct predictions over the total num- 
ber of predictions, and the ROC metric describes the number of 
correctly classified outcomes based on different model thresholds. 
ROC is a useful metric to use when the different classes being pre- 
dicted are imbalanced, witli noticeably different sizes. Since most 
players are unlikely to buy a specific game, ROC is a good metric 
to utilize for this use case. When 1 ran this script, the resuit was 
an accuracy of 86.6% and an ROC score of 0.757. 

Linear and logistic regression models with scikit-learn are a good 
starting point for many machine learning projects. We’ll explore 
more complex models in this book, but one of the general strategies 
1 take as a data scientist is to quickly dcliver a proof of concept, 
and then iterate and improve a model once it is shown to provide 
value to an organization. 

1.6.3 Keras Regression 

While 1 generally recommend starting with simple approaches 
when building model pipclincs, deep learning is becoming a pop¬ 
ular tool for data scientists to apply to uew problems. It’s great 
to explore this capability when tackling new problems, but scaliug 
up deep learning in data Science pipelines presents a new set of 
challenges. For example, PySpark does not currently have a na¬ 
tive way of distributing the model application phase to big data. 
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There’s plenty of books for getting started with deep learning in 
Python, such as (Chollet, 2017). 

In this section, wc’ll repeat the sanie task from the prior section, 
which is predicting which users are likely to buy a game based on 
their prior purchases. Instead of using a shallow learning approach 
to predict propensity scores, we’ll use the Keras framework to build 
a neural network for predicting this outcome. Keras is a general 
framework for working with deep learning implementations. We 
can install these dependencies from the command line: 


pip install —user tensorflow==l.14.0 
pip install --user keras==2.2.4 


This process can take awhile to complete, and based on your en- 
vironment may run into installation issues. It’s recommended to 
verify that the installation worked by checking your Keras version 
in a Jupyter notebook: 

import tensorflow as tf 
import keras 

from keras import models, layers 
import matplotlib.pyplot as plt 
keras._version_ 


The general process for building models with Keras is to set up 
the structure of the model, compile the model, fit the model, and 
evaluate the model. We’ll start with a simple model to provide a 
baseline, which is shown in the snippet below. This code creates a 
network with an input layer, a dropout layer, a liidden layer, and 
an output layer. The input to the model is 10 binary variables that 
describe prior games purchased, and the output is a prediction of 
the likelihood to purchase a specified game. 


x_train, x_test, y_train, y_test = train_test_split ( 

gamesDF.drop([' labet 1 ], axis=l),gamesDF['label'],test_size=0.3) 
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# define the network structure 
model = models.Sequential () 

model .add(layers.Dense(64, activation='relu', input_shape=(10,))) 
model .add(layers.Dropout(0.1)) 
model .add(layers.Dense(64, activation=' relu' )) 
model .add(layers.Dense(l, activation='si gmoi d ' )) 

# define ROC AUC as a metric 
def auc(y_true, y_pred): 

auc = tf.metrics.auc (y_true, y_pred)[1] 
keras.backend.get_session (). run ( 

tf.local_variables_initializer()) 

return auc 


# compile and fit the model 
model .compile(optimizer=' rmsprop ' , 

loss=' binary_crossentropy' , metrics=[auc]) 
history = model.fit (x_train, y_train, epochs=100, batch_size=100, 
validation_split = .2, verbose=0) 

Since the goal is to identify the likelihood of a player to purchase a 
game, ROC is a good metric to use to evaluate the performance of 
a model. Keras does not support this directly, but we can define a 
custom metrics function that wraps the auc functionality provided 
by TensorFlow. 

Next, we specify how to optimize the model. We’ll use rmsprop for 
the optimizer and binary_crossentropy for the loss function.The last 
step is to train the model. The code snippet shows how to fit the 
model using the training data set, 100 training epochs witli a batch 
size of 100, and a cross validation split of 20%. This process can 
take awhile to run if you increase the number of epochs or decrease 
the batch size. The validation set is sampled from only the training 
data set. 

The resuit of this process is a history object that tracks the loss 
and metrics on the training and validation data sets. The code 
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FIGURE 1.8: ROC AUC metrics for the training process. 


snippet below sliows how to plot these values using Matplotlib. 
The output of this step is shown in Figure 1.8. The plot shows that 
both the training and validation data sets leveled off at aronnd a 
0.82 AUC metric during the model training process. To compare 
this approach with the logistic regression results, we’ll evaluate the 
performance of the model on the holdout data set. 


loss = history.history [ 'auc 1 ] 
val_loss = history.history[ 'val_auc'] 
epochs = range(l, len(loss) + 1) 

plt.fi gure( figsize=(10,6) ) 

plt.plot (epochs, loss, 'bo' , label= 'Training AUC') 
plt.plot (epochs, val_loss, 'b', label= 'Validation AUC 1 ) 
plt.legend () 
plt. show( ) 


To measnre the performance of the model on the test data set, we 
can use the evaluate function to measure the ROC metric. The code 
snippet below shows how to perform this task on our training data 
set, which results in an ROC AUC value of 0.816. This is noticeably 
better than the performance of the logistic regression model, with 
an AUC value of 0.757, but using other shallow learning methods 
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such as random forests or XGBoost would likely per for m much 
better on this task. 


results = model.evaluate (x_test, y_test, verbose = 0) 
print("ROC: " + str (results[1])) 


1.7 Automated Feature Engineering 

Automated feature engineering is a powerful tool for reducing the 
amount of manual work needed in order to build predictive mod- 
els. Instead of a data scientist spending days or weeks coming up 
with the best features to describe a data set, we can use tools 
tliat approximate tliis process. One library I’ve been working with 
to implement this step is FeatureTools. It takes inspiration frorn 
the automated feature engineering process in deep learning, but 
is meant for shallow learning problems where you already have 
structured data, but need to translate multiple tables into a single 
record per user. The library can be installed as follows: 


sudo yum install gcc 

sudo yum install python3-devel 

pip install --user framequery 

pip install --user fsspec 

pip install --user featuretools 


In addition to this library, I loaded the framequery library, which 
enables writing SQL queries against dataframes. Using SQL to 
work with dataframes versus specific interfaces, such as Pandas, is 
useful when translating between different execution environments. 

The task we’ll apply the FeatureTools library to is predicting which 
games in the Kaggle NHL data set are postseason games. We’ll 
make this prediction based on summarizations of the play events 
that are recorded for each game. Since there can be hundreds of 
play events per game, we need a process for aggregating these into 
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a single summary per game. Once we aggregate these events into 
a single game record, we can apply a logistic regression model to 
predict whether the game is regular or postseason. 

The first step we’ll perform is loading the data sets and performing 
some data preparation, as shown below. After loading the data sets 
as Pandas dataframes, we drop a few attributes frorn the plays 
object, and fili any missing attributes witli 0. 


import pandas as pd 

game_df = pd.read_csv ( "game.csv" ) 
plays_df = pd.read_csv ("game_plays.csv" ) 

plays_df = plays_df.drop ([ 'secondaryType' , 1 periodType 1 , 

'dateTime 1 , 'rink_side 1 ] , axis=l) .fi lina (0) 


To translate the play events into a game summary, we’ll first 1- 
hot econde two of the attributes in the plays dataframe, and then 
perform deep feature synthesis. The code snippet below shows how 
to perform the first step, and uses FeatureTools to accomplish tliis 
task. The resuit is quite similar to using the get_dummies function 
in Pandas, but this approach requires some additional steps. 

The base representation in FeatureTools is an EntitySet, which 
describes a set of tables and the relationships between them, 
which is similar to defining foreign key constraints. To use the 
encode_features function, we need to first translate the plays 
dataframe into an entity. We can create an EntitySet directly from 
the piays_df object, but we also need to specify which attributes 
sliould be handled as categorical, using the variable_types dictio- 
nary parameter. 


import featuretools as ft 

from featuretools import Feature 


es = ft.EntitySet (i d="plays" ) 
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, event , , event - eve 1 

event = _ event event = - event = event - event = 

Faceoff g^ ot = Hit Stoppage Missed Giveaway Takeaway Penalty 


0 0 0 0 0 
11 0 0 0 


0 0 0 0 
0 0 0 0 


0 0 
0 0 


2 2 0 0 0 


0 0 0 


0 0 0 


3 3 


0 0 


0 0 0 0 


0 0 


4 4 0 0 0 


0 0 0 


0 0 


5 rows x 37 columns 


team_id_against x y period periodTime periodTimeRe 


0.0 0.0 0.0 1 0 

0.0 0.0 0.0 1 0 

0.0 0.0 0.0 1 0 

1.0 0.0 0.0 1 0 

1.0 28.0 24.0 1 21 


FIGURE 1.9: The 1-hot encoded Plays dataframe. 


es = es.entity_from_dataframe (entity_id=" plays" ,dataframe=plays_df 
,i ndex="play_i d" , variable_types = { 

"event" : ft.variable_types.Categorical, 

"deseription" : ft.variable_types.Categorical }) 

fl = Feature(es ["plays"]["event"] ) 

f2 = Feature (es ["plays"]["deseription"] ) 


encoded, defs = ft.encode_features (plays_df, [fl, f2], top_n=10) 
encoded.reset_i ndex(inplace=True) 
encoded.head () 

Next, we pass a list of features to the encode_features function, 
which returns a new dataframe with the dummy variables and a 
defs object that describes how to translate an input dataframe 
into the 1-hot encoded format. For pipelines later on in this book, 
where we need to apply transformations to new data sets, we’ll 
store a copy of the defs object for later use. The resuit of applying 
this transformation to the plays dataframe is shown in Figure 1.9. 

The next step is to aggregate the hundreds of play events per game 
into single game summaries, where the resulting dataframe has 
a single row per game.To accomplish this task, we’ll recreate the 
EntitySet from the prior step, but use the 1-hot encoded dataframe 
as the input. Next, we use the normalize_entity function to describe 
games as a parent object to plays events, where ali plays with the 
sanie game_id are grouped together. The last step is to use the dfs 
function to perform deep feature synthesis. DFS applies aggregate 
calculations, such as sum and max, across the different features in 
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gamejd 

SUM(plays. index) 

SUM(plays.event 
= Faceoff) 

SUM(plays.event 
= Shot) 

SUM(plays.event 
= Hit) 

SUM(plays.event 
= Stoppage) 

SUM(plays.event 
= Blocked Shot) 

SUM(plays.event 
= Missed Shot) 

SUM(plays.event 
= Giveaway) 
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44 

37 

33 

28 

5 rows x 212 columns 









FIGURE 1.10: Generated features for the NHL data set. 


the cliild dataframe, in order to collapse hundreds of records into 
a single row. 


es = ft.EntitySet (i d="plays" ) 

es = es.entity_from_dataframe (entity_i d="plays" , 

dataframe=encoded, index="play_id") 
es = es.normalize_entity (base_entity_id="plays", 

new_entity_i d="games" , index="game_i d" ) 


features,transform=ft.dfs(entityset=es , 

target_entity="games",max_depth=2) 
features. reset_i ndex(inp'Lace=True) 
features.head () 

The resuit of this process is shown in Figure 1.10. The shape of 
the sampled dataframe, 5 rows by 212 columns, indicates that we 
have generated hundreds of features to describe eacli game using 
deep feature synthesis. Instead of hand coding this translation, we 
utilized the FeatureTools library to automate this process. 

Now that we have hundreds of features for describing a game, we 
can use logistic regression to rnake predictions about the garnes. 
For this task, we want to predict whether a game is regular season 
or postseason, where type = ' p 1 . The code snippet below shows how 
to use the framequery library to combine the generated features 
with the initially loaded garnes dataframe using a SQL join. We 
use the type attribute to assign a label, and then return all of the 
generated features and the label. The resuit is a dataframe that 
we can pass to scikit-learn. 
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import framequery as fq 


# assign labeis to the generated features 

features = fq.execute(. 

SELECT f.* 

,case when g.type = 'P' then 1 else 0 end as label 
FROM features f 
JOIN game_df g 

on f.game_id = g.game_id 

II II II ^ 

We can re-use the logistic regression code from above to build a 
model that predicts whether an NHL game is a regular or postsea- 
son game. The updated code snippet to build a logistic regression 
model with scikit-learn is shown below. We drop the game_id col- 
umn before fitting the model to avoid training the model on tliis 
attribute, wliich typically results in overfitting. 


from sklearn.Iinear_model import LogisticRegression 
from sklearn.metrics import roc_auc_score 

# create inputs for sklearn 
y = features[' label'] 

X = features.drop ([' label 1 , 'game_id'], axis=l) .filina (0) 

# train a classifier 

Ir = LogisticRegression () 
model = lr.fit(X, y) 

# Results 

print ( "Accuracy: " + str(model.score (X, y))) 

print ( "ROC" + str(roc_auc_score(y, model.predict_proba (X)[:,1]))) 

The resuit of this model was an accuracy of 94.7% and an ROC 
measure of 0.923. Whilc we likely could have created a better 
performing model by manually specifying how to aggregate play 
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events into a game summary, we were able to build a model with 
good accuracy while automating much of this process. 


1.8 Conclusion 

Building data products is becoming an essential competency for ap- 
plied data scientists. The Python ecosystem provides useful tools 
for taking prototype models and scaling them up to production- 
quality Systems. In this chapter, we laid the groundwork for the 
rest of this book by introducing the data sets, coding tools, cloud 
environments, and predictive models that we’ll use to build scal- 
able model pipelines. We also explored a recent Python library 
called FeatureTools, wliich enables automating much of the fea- 
ture engineering steps in a model pipeline. 

In our current setup, we built a simple batch model on a single 
machine in the cloud. In the next chapter, we’11 explore how to 
share our models with the world, by exposing them as endpoints 
on the web. 
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In order for a machine learning model to be useful, you need a 
way of sharing the results with other Services and applications 
within your organization. While you can precompute results and 
save them to a database using a batch pipeline approach, it’s of- 
ten necessary to respond to requests in real-time with up-to-date 
information. One way of achieving this goal is by setting up a pre- 
dictive model as a web endpoint that can be invoked frorn other 
Services. This chapter shows how to set up this functionality for 
both scikit-learn and Keras models, and introduces Python tools 
that can hclp scale up this functionality. 

It’s good to build experience both hosting and consuming web end¬ 
points when building out model pipelines with Python. In some 
cases, a predictive model will need to pull data points frorn other 
Services before making a prediction, such as needing to pull addi- 
t.ioual attributes about a user’s history as input to feature engineer- 
ing. In this chapter, we’ll focus on JSON based Services, because it 
is a popular data format and works well with Python’s data types. 

A model as an endpoint is a System that provides a prediction in 
response to a passed in set of parameters. Tliese parameters can 
be a feature vector, irnage, or other type of data that is used as 
input to a predictive model. The endpoint then makes a prediction 
and returns the results, typically as a JSON payload. The benefits 
of setting up a model this way are that other Systems can use the 
predictive model, it provides a real-time resuit, and can be used 
within a broader data pipeline. 

In this chapter, we’ll cover calling web Services using Python, set¬ 
ting up endpoints, saving models so that they can be used in pro- 
duction environments, hosting scikit-learn and Keras predictive 
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models, scaling up a Service witli Gunicorn and Heroku, and build- 
ing an interactive web application with Plotly Dash. 


2.1 Web Services 

Before we liost a predictive model, we’ll use Python to call a web 
Service and to process the resuit. After showing how to process a 
web response, we’ll set up our own Service that echoes the passed 
in message back to the caller. There’s a few different libraries we’ll 
need to install for the examples in this chapter: 


pip 

install 

—user 

requests 

pip 

install 

—user 

flask 

pip 

install 

—user 

gunicorn 

pip 

install 

—user 

mlflow 

pip 

install 

—user 

pillow 

pip 

install 

—user 

dash 


These libraries provide the following functionality: 

• requests: Provides functions for GET and POST commands. 

• flask: Enables functions to be exposed as HTTP locations. 

• gunicorn: A WSGI server that enables hosting Flask apps in 
production environments. 

• mlflow: A model library that provides model persistence. 

• pillow: A fork of the Python Imaging Library. 

• dash: Enables writing interactive web apps in Python. 

Many of the tools for building web Services in the Python ecosys- 
tem work well with Flask. For example, Gunicorn can be used to 
host Flask applications at production scale, and the Dash library 
builds on top of Flask. 

To get started with making web requests in Python, we’ll use the 
Cat Facts Heroku app 1 . Heroku is a cloud platform that works 


Gttps : //cat-fact. herokuapp. com/#/ 





2.1 Web Services 


35 


well for hosting Python applications that we’ll explore later in this 
chapter. The Cat Facts Service provides a simple API that provides 
a JSON response containing interesting tidbits about fclines. We 
can use the /facts/random endpoint to retrieve a random fact using 
the requests library: 


import requests 


resuit = requests.get ("http://cat-fact. herokuapp.com/ facts/random") 

pri nt (resuit) 

print ( resuit.j son ()) 

print ( resuit.j son ()[ 'text' ]) 


This snippet loads the requests library and then uses the get func- 
tion to perform an HTTP get for the passed in URL. The resuit 
is a response object that provides a response code and payload if 
available. In this case, the payload can be processed using the json 
function, which returns the payload as a Python dictionary. The 
three print statements show the response code, the full payload, 
and the value for the text key in the returned dictionary object. 
The output for a run of this script is shown below. 

<Response [200]> 

{'used': False, 'source' : 'api', 'type': 'cat', 'deleted': False 

,' _id' : '591f98c5dlf17al53828aa0b' , '_ v' : 0, 'text': 

'Domestic cats purr both when inhaling and when exhaling.', 

'updatedAt' : '2019-05-19T20:22:45.768Z' , 

'createdAt' : '2018-01-04T01:10:54.673Z' } 

Domestic cats purr both when inhaling and when exhaling. 


2.1.1 Echo Service 

Before we set up a complicated environment for hosting a predic- 
tive rnodel, we’ll start with a simple example. The hrst Service 
we’ll set up is an echo application that returns the passed in mes- 
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sage parameter as part of the response payload. To implement this 
functionality, we’ll use Flask to build a web Service hosted on an 
EC2 instance. This Service can be called on the open web, using 
the public IP of the EC2 instance. You can also run this Service 
on your local machine, but it won’t we accessible over the web. In 
order to access the function, you’11 need to enable access on port 
5000, which is covered in Section 1.4.1. The complete code for the 
echo web Service is shown below: 


import flask 

app = flask.Flask(_name_) 

@app.route("/", methods= ["GET" , "POST"] ) 
def predict(): 

data = {"success": False} 

# check for passed in parameters 
params = flask.request.json 

if params is None: 

params = flask.request.args 

# if parameters are found, echo the msg parameter 
if "msg" in params.keys(): 

data[" response"] = params.get( "msg" ) 
data ["success"] = True 

return flask.jsonify(data) 

if _name_== 1 _ main_ 

app.run(host='0. 0.0.0') 


The first step is loading the Flask library and creating a Flask 
object using the name special variable. Next, we define a predi ct 
function with a Flask annotation that specifies that the function 
should be hosted at “/” and accessible by HTTP get and post 
commands. The last step specifies that the application should run 
using 0 . 0 . 0.0 as the liost, which enables remote machines to access 


2.1 Web Services 


37 


the application. By default, the application will run on port 5000, 
but it’s possible to override this setting with the port parameter. 
When running Flask directly, we need to call the run function, but 
we do not want to call the command when running as a module 
within another application, such as Gunicorn. 

The predict function returns a JSON response based on the passed 
in parameters. In Python, you can tliink of a JSON response as a 
dictionary, because the jsonify function in Flask makes the trans- 
lation between these data formats seamless. The function hrst de- 
fines a dictionary with the success key set to False. Next, the func¬ 
tion checks if the request.json or request.args values are set, which 
indicates that the callcr passed in arguments to the function, which 
we’ll cover in the next code snippet. If the user has passed in a msg 
parameter, the success key is set to True and a response key is set 
to the msg parameter in the dictionary. The resuit is then returned 
as a JSON payload. 

Instead of running this in a Jupyter notebook, we’11 save the script 
as a hle called echo.py. To launch the Flask application, run python3 
echo.py on the command line. The resuit of running this command 
is shown below: 


python3 echo.py 

* Serving Flask app "echo" (lazy loading) 

* Environment: production 

WARNING: This is a development server. 

Do not use it in a production deployment. 

Use a production WSGI server instead. 

* Debug mode: off 

* Running on http: //0 .0.0.0:5000/ (Press CTRL+C to quit) 

The output indicates that the Service is running on port 5000. If 
you launched the Service on your local machine, you can browse 
to http : //localhost: 5000 to call the application, and if using EC 2 
yondl need to use the public IP, such as http: //52.90.199.190 = 5000. 
The resuit will be {"response":null,"success":false}, which indi- 
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cates that the Service was called but that no message was provided 
to the echo Service. 

We can pass parameters to the web Service using a few different ap- 
proaches. The parameters can be appended to the URL, specified 
using the params object when using a GET command, or passed 
in using the json parameter when using a POST command. The 
snippet below shows how to perform these types of requests. For 
small sets of parameters, the GET approach works fine, but for 
larger parameters, such as sending images to a server, the POST 
approach is preferred. 


import requests 

resuit = requests.get ("http://52.90.199.190: 5000/?msg=HelloWorld!" ) 
print(result.json()) 

resuit = requests.get ( "http://52.90.199.190:5000/" , 

params = { 'msg': 'Helio from params' }) 

print ( resuit.j son ()) 

resuit = requests.post ("http://52.90.199.190:5000/" , 

json = { 'msg': 'Helio from data' }) 

print(result.json()) 


The output of the code snippet is shown below. There are 3 JSON 
responses showing that the Service successfully received the mes¬ 
sage parameter and echoed the response: 


{'response': 'HelloWorld!' , 'success' : True} 
{'response': 'Helio from params', 'success': True} 
{'response': 'Helio from data', 'success': True} 


In addition to passing values to a Service, it can be useful to pass 
larger payloads, such as images when hosting deep learning models. 
One way of achieving this task is by encoding images as strings, 
which will work with our existing echo Service. The code snippet 
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FIGURE 2.1: Passing an image to the echo web Service. 

below shows how to read in an image and perform base64 encoding 
on the image before adding it to the request object. The echo 
Service responds with the image payload and we can use the pil 
library to render the image as a plot. 

import matplotlib.pyplot as plt 

import numpy as np 

from PIL import Image 

import io 

import base64 

image = open ( "luna.png" , "rb").read() 
encoded = base64.b64encode (i mage) 

resuit = requests.get ( "http://52.90.199.190:5000/" , 


json = {'msg 1 : encoded}) 


encoded = resuit.json ()[' response' ] 
imgData = base64.b64decode (encoded) 

plt.imshow( np.array(Image.open(io.BytesIO (imgData)))) 

We can run the script within a Jupyter notebook. The script will 
load the image and send it to the server, and then render the resnlt 
as a plot. The output of this script, which uses an image of my 
in-laws’ cat, is shown in Figure 2.1. We won’t work much with 
image data in this book, but I did want to cover how to use more 
complex objects with web endpoints. 
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2.2 Model Persistence 

To host a model as a web Service, we need to provide a model 
object for the predict function. We can train the model within 
the web Service application, or we can use a pre-trained model. 
Model persistence is a term used for saving and loading models 
within a predictive model pipeline. It’s common to train models 
in a separate workflow than the pipeline used to serve the model, 
such as a Flask application. In this section, we’ll save and load 
both scikit-learn and Keras models, with both direct serialization 
and the MLFlow library. The goal of saving and loading these 
models is to make the logistic regression and deep learning models 
we built in Chapter 1 available as web endpoints. 

2.2.1 Scikit-Learn 

We’ll start with scikit-learn, which we previously used to build a 
propensity model for identifying which players were most likely to 
purchase a game. A simple LogisticRegression model object can be 
created using the following script: 


import pandas as pd 

from sklearn.Iinear_model import LogisticRegression 

df = pd.read_csv ( "https://github.com/bgweber/Twitch/ 

raw/master/Recommendations/games-expand.csv") 
x = df.drop ([' label'] , axis=l) 
y = df[ 1 label 1 ] 

model = LogisticRegression () 
model. fit(x, y) 

The default. way of saving scikit-learn models is by using pickle, 
which provides serialization of Python objects. You can save a 
model using dump and load a model using the load function, as 
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shown below. Once you have loaded a model, you can use the 
prediction functions, such as predict_proba. 


import pickle 

pickle.dump (model, open ( "logit.pkl" , 'wb')) 

model = pickle.load (open ( "logit.pkl" , 'rb 1 )) 
model.predi ct_proba(x) 


Pickle is great for simple workflows, but can run into serialization 
issues when your execution environment is different from your pro- 
duction environment. For example, you might train models on your 
local machine using Python 3.7 but need to host the models on an 
EC2 instance running Python 3.6 with different library versions 
installed. 

MLflow is a broad project focused on improving the lifecycle of 
machine learning projects. The Models component of tliis platform 
focuses on making models deployable across a diverse range of exe¬ 
cution environments. A key goal is to make models more portable, 
so tliat your training environment does not need to match your 
deployment environment. In the current version of MLflow, many 
of the save and load functions wrap direct serialization calls, but 
future versions will be focused on using generalized model formats. 

We can use MLflow to save a model using skiearn.save_modei and 
load a model using sklearn.load_model. The script below shows how 
to perform the sanie task as the prior code example, but uses 
MLflow in place of pickle. The file is saved at the modei_path lo- 
cation, which is a relative path. There’s also a commented out 
command, which needs to be uncommented if the code is executed 
multiple times. MLflow currently throws an exception if a model 
is already saved at the current location, and the rmtee command 
can be used to overwrite the existing model. 


import mlflow 
import mlflow.sklearn 
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import shutil 

model_path = "models/logit_games_vl" 

#shuti1. rmtree(model_path) 

mlflow.sklearn.save_model (model, model_path) 

loaded = mlflow.sklearn.load_model (model_path) 
loaded.predict_proba (x) 


2.2.2 Keras 

Keras provides built-in functionality for saving and loading deep 
learning models. We covered building a Keras model for the games 
data set in Section 1.6.3. The key steps in this process are shown 
in the following snippet: 

import tensorflow as tf 
import keras 

from keras import models, layers 

# define the network structure 
model = models.Sequential () 

model .add(layers.Dense(64,activation=' relu' ,input_shape=(10,))) 
model .add(layers.Dropout(0.1)) 
model .add(layers.Dense(64, activation='relu')) 
model .add(layers.Dense(l, activation='si gmoid 1 )) 

def auc(y_true, y_pred): 

auc = tf.metrics.auc (y_true, y_pred)[1] 
keras.backend.get_sessi on (). run ( 

tf.local_variables_initializer()) 

return auc 

model.compile (optimizer= 'rmsprop 1 , 

loss=' binary_crossentropy' , metrics=[auc]) 
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history = model. fit(x, y, epochs=100, batch_size=100, 
validation_split = .2, verbose=0) 

Once we have trained a Keras model, we can use the save and 
load_model functions to persist and reload the model using the h5 
hle format. One additional step here is that we need to pass the 
custom auc function we dehned as a metric to the load function in 
order to reload the model. Once the model is loaded, we can call 
the prediction functions, such as evaluate. 

from keras.models import load_model 
model. save( "games.h5" ) 


model = load_model ( 1 games.h5 1 , custom_objects={ 1 auc 1 : auc}) 
model.evaluate (x, y, verbose = 0) 

We can also use MLflow for Keras. The save_modei and ioad_modei 
functions can be used to persist Keras models. As before, we need 
to provide the custom-dehned auc function to load the model. 


import mlflow.keras 

model_path = "models/keras_games_vl" 
mlflow.keras.save_model (model, model_path) 

loaded = mlflow.keras.load_model (model_path, 

custom_objects={ 'auc' : auc}) 
loaded.evaluate (x, y, verbose = 0) 


2.3 Model Endpoints 

Now that we know how to set up a web Service and load pre-trained 
predictive models, we can set up a web Service that provides a 
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prediction resuit in response to a passed-in instance. We’ll deploy 
models for the games data set using scikit-learn and Keras. 

2.3.1 Scikit-Learn 

To use scikit-learn to host a predictive model, we’ll modify our echo 
Service built with Flask. The main changes to make are loading 
a scikit-learn model using MLflow, parsing out the feature vector 
to pass to the model frorn the input parameters, and adding the 
model resuit to the response payload. The updated Flask applica- 
tion for using scikit-learn is shown in the following snippet: 


import pandas as pd 

from sklearn.Iinear_model import LogisticRegression 

import mlflow 

import mlflow.sklearn 

import flask 


model_path = "models/logit_games_vl" 

model = mlflow.sklearn.load_model (model_path) 

app = flask. Flask(_ name_) 


@app.route("/", methods= ["GET" ," POST"] ) 
def predict (): 

data = {"success": False} 
params = flask.request.args 


if "Gl" in params.keys (): 

new_row = { "Gl" : params.get ( "Gl" ), "G2" : 

"G3": params.get ("G3" ), "G4" : 
"G5": params.get ("G5") , "G6" : 
"G7" : params.get ( "G7" ), "G8" : 
"G9": params.get ( "G9" ), "G10" 


params.get ( "G2" ), 
params.get ( "G4" ), 
params.get ( "G6" ), 
params.get ( "G8" ), 
params.get ( "G10" )} 


new_x = pd.DataFrame.from_dict (new_row, 


orient 


index") .transpose() 
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data[" response"] = str(model.predict_proba (new_x)[0][1]) 
data ["success"] = True 

return flask.jsonify (data) 

if_ name_== 1 _ main_ 

app.run(host='0. 0.0.0') 


After loading the required libraries, we use ioad_modei to load the 
scikit-learn model object using MLflow. In tliis setup, the model is 
loaded only once, and will not be updated unless we relaunch the 
application. The main change from the echo Service is creating the 
feature vector that we need to pass as input to the modehs predic- 
tion functions. The new_row object creates a dictionary using the 
passed in parameters. To provide the Pandas row format needed 
by scikit-learn, we can create a Pandas dataframe based on the dic¬ 
tionary and then transpose the resuit, which creates a dataframe 
with a single row. The resulting dataframe is then passed to pre¬ 
di ct_proba to make a propensity prediction for the passed in user. 
The model output is added to the JSON payload under the re- 
sponse key. 

Similar to the echo Service, we’ll need to save the app as a Python 
file rather than running the code directly in Jupyter. I saved the 
code as predict.py and launched the endpoint by running python3 
predict.py, which runs the Service on port 5000. 

To test the Service, we can use Python to pass in a record repre- 
senting an individual user. The dictionary defines the list of games 
that the user has previously purchased, and the GET command 
is used to call the Service. For the example below, the response 
key returned a value of 0 . 3812 . If you are running this script in a 
Jupyer notebook 011 an EC2 instance, you’11 need to enable remote 
access for the machine 011 port 5000. Even though the web Service 
and notebook are running on the same machine, we are using the 
public IP to reference the model endpoint. 
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import requests 

new_row = { "Gl": 0, "G2": 0, "G3" : 0, "G4" : 0, "G5": 0, 

"G6" : 0, "G7" : 0, "G8" : 0, "G9" : 0, "G10": 1 } 

resuit = requests.get ( "http://52.90.199.190:5000/" , params=new_row) 
print(resuit. json() ['response' ]) 


2.3.2 Keras 

The setup for Keras is similar to scikit-learn, but there are a few 
additions that need to be made to handle the TensorFlow graph 
context. We also need to redefine the auc function prior to loading 
the model using MLflow. The snippet below shows the complete 
code for a Flask app that serves a Keras model for the game pur- 
chases data set. 

The main tliing to note in this script is the use of the graph object. 
Because Flask uses multiple threads, we need to define the graph 
used by Keras as a global object, and grab a reference to the graph 
using the with statement when serving requests. 


import pandas as pd 
import mlflow 
import mlflow.keras 
import flask 
import tensorflow as tf 
import keras as k 

def auc(y_true, y_pred): 

auc = tf.metrics.auc (y_true, y_pred)[1] 
k.backend.get_session () .run ( 

tf.local_variables_initializer()) 

return auc 


global graph 
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graph = tf.get_default_graph () 
model_path = "models/keras_games_vl" 
model = mlflow.keras.load_model (model_path, 

custom_objects={ 1 auc 1 : auc}) 

app = flask.Flask(_ name_) 

@app.route("/", methods= ["GET" ," POST"] ) 
def predict (): 

data = {"success": False} 
params = flask.request.args 

if "Gl" in params.keys (): 


new_row = { "Gl" : 

params.get ( 1 

Gl") 

, "G2" 

: params. get("G2' 

), 

"G3" : 

params.get ( 1 

G3" ) 

, "G4" 

: params. get("G4' 

), 

"G5" : 

params.get ( 1 

G5" ) 

, "G6" 

: params. get("G6' 

), 

"G7" : 

params.get ( 1 

G7" ) 

, "G8" 

: params. get("G8' 

), 

"G9": params.get ( "G9' 

), " 

G10" : 

aarams.get ( "G10" ) 

} 


new_x = pd.DataFrame.from_dict (new_row, 

orient = "index" ) .transpose () 


with graph.as_default (): 

data [" response"] = str(model.predi ct(new_x)[0][0]) 
data ["success"] = True 

return flask.jsonify (data) 

if _name_== 1 _ main_ 

app. run(host='0. 0.0.0') 


I saved the script as keras_predict.py and then launched the Flask 
app using python3 keras_predict. py. The resuit is a Keras model 
running as a web Service on port 5000. To test the script, we can 
run the same script from the following section where we tested a 
scikit-learn model. 
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2.4 Deploying a Web Endpoint 

Flask is great for prototyping models as web Services, but it’s not 
intended to be used directly in a production environment. For 
a proper deployment of a web application, you’11 want to use a 
WSGI Server, which provides scaling, routing, and load balancing. 
If you’re looking to liost a web Service that needs to liandle a large 
workload, then Gunicorn provides a great solution. If instead you’d 
like to use a hosted solution, then Heroku provides a platform for 
hosting web Services written in Python. Heroku is useful for hosting 
a data Science portfolio, but is limited in terms of components 
when building data and rnodel pipclines. 

2.4.1 Gunicorn 

We can use Gunicorn to provide a WSGI server for our echo Flask 
application. Using gunicorn helps separate the functionality of an 
application, which we implemented in Flask, with the deployment 
of an application. Gunicorn is a lightweight WSGI implementation 
that works well with Flask apps. 

It’s straightforward to switch form using Flask directly to using 
Gunicorn to run the web Service. The new command for running 
the application is shown below. Note that we are passing in a bind 
parameter to enable remote connections to the Service. 

gunicorn —bind 0.0.0.0 echo:app 

The resuit, on the command line is shown below. The main differ- 
ence frorn before is that we now interface with the Service on port 
8000 rather tlian on port 5000. If you want to test out the Service, 
you’11 need to enable remote access on port 8000. 

gunicorn —bind 0.0.0.0 echo:app 
[INFO] Starting gunicorn 19.9.0 
[INFO] Listening at: http: //0 .0.0.0:8000 (9509) 
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[INFO] Using worker: sync 

[INFO] Booting worker with pid: 9512 

To test the Service using Python, we can run the following snippet. 
You’11 need to make sure that access to port 8000 is enabled, as 
discussed in Section 1.4.1. 


resuit = requests.get ( "http://52.90.199.190:8000/" , 

params = { 'msg': 'Helio from Gunicorn' }) 

print ( resuit.j son ()) 


The resuit is a JSON response with the passed in message. The 
main distinction from our prior setup is that we are now using 
Gunicorn, which can use multiple threads to handle load balanc- 
ing, and can perform additional server configuration that is not 
available when using only Flask. Conhguring Gunicorn to serve 
production workloads is outside the scope of this book, because 
it is a hosted solution where a team needs to manage DevOps of 
the system. Instead, we’ll focus on managed Solutions, including 
AWS Lambda and Cloud Functions in Chapter 3, where minimal 
overhead is needed to keep systems operational. 

2.4.2 Heroku 

Now that we liave a Gunicorn application, we can liost it in the 
cloud using Heroku. Python is one of the core languages supported 
by this cloud environment. The great thing about using Heroku is 
that you can host apps for free, which is great for showcasing data 
Science projects. The first step is to set up an account on the web 

Site: https://www.heroku.com/ 

Next, we’ll set up the command line tools for Heroku, by running 
the commands shown below. There can be some complications 
when setting up Heroku on an AMI EC2 instance, but downloading 
and unzipping the binaries directly works around these problems. 
The steps shown below download a release, extract it, and install 
an additional dependency. The last step outputs the version of 
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Heroku installed. I got the following output: heroku/ 7 . 29.0 linux- 
x64 node-vll.14.0. 


wget https://cli-assets.heroku.com/heroku-linux-x64.tar.gz 

unzip heroku-linux-x64.tar.gz 

tar xf heroku-linux-x64.tar 

sudo yum -y install glibc.i686 

/horne/ec2-user/heroku/bin/heroku --version 


Once Heroku is installed, we need to set up a project for where we 
will deploy projects. We can use the CLI to create a new Heroku 
project by running the following commands: 


/ horne / ec2-u se r / he ro ku / bi n/ heroku logi n 
/ horne / ec2-u se r / he ro ku / bi n/ heroku create 


This will create a unique app name, such as obscure-coast-69593. It’s 
good to test the setup locally before deploying to production. In 
order to test the setup, you’11 need to install the django and django- 
heroku packages. Heroku has some dependencies on Postgres, which 
is why additional install and easy_install commands are included 
when installing these libraries. 


pip install —user django 

sudo yum install gcc python-setuptools postgresql-devel 

sudo easy_install psycopg2 

pip install --user django-heroku 


To get started witli building a Heroku application, we’ll hrst down- 
load the sample application from GitHub and then modify the 
project to include our echo application. 


sudo yum install git 

git clone https://github.com/heroku/python-getting-started.git 
cd python-getting-started 
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Next, we’ll make our changes to the project. We copy our echo.py 
file into the directory, add Flask to the list of dependencies in the 
requirements.txt file, override the command to run in the Procfile, 
and then call heroku local to test the configuration locally. 


cp ../echo.py echo.py 

echo 'flask' >> requirements.txt 

echo "web: gunicorn echo:app" > Procfile 

/horne/ec2-user/heroku/bin/heroku local 


You should see a resuit that looks like this: 

/ horne / ec2-u se r / he ro ku / bi n/ heroku local 

[OKAY] Loaded ENV .env File as KEY=VALUE Format 

[INFO] Starting gunicorn 19.9.0 

[INFO] Listening at: http: //0 .0.0.0:5000 (10485) 

[INFO] Using worker: sync 

[INFO] Booting worker with pid: 10488 

As before, we can test the endpoint using a browser or a Python 
call, as shown below. In the Heroku local test configuration, port 
5000 is used by default. 

resuit = requests.get ( "http://localhost:5000/" , 

params = { 'msg': 'Helio from Heroku Local'}) 

print ( resuit.j son ()) 


The final step is to deploy the Service to production. The git com- 
mands are used to push the results to Heroku, which automatically 
releases a new version of the application. The last command telis 
Heroku to scale up to a single worker, which is free. 


git add echo.py 

git commit . 

git push heroku master 

/home/ec2-user/heroku/bin/heroku ps:scale web=l 
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After these steps run, there should be a message that the applica- 
tion has been deployed to Heroku. Now we can call the endpoint, 
which has a proper URL, is secured, and can be used to publicly 
share data Science projects. 


resuit = requests.get ( "https://obscure-coast-69593.herokuapp.com" , 

params = { 'msg': 'Helio from Heroku Prod' }) 

print ( resuit.j son ()) 


There’s nmny languages and tools supported by Heroku, and it’s 
useful for hosting small-scale data Science projects. 


2.5 Interactive Web Services 

While the Standard deployment of a modcl as a web Service is an 
API that you can call programmatically, it’s often useful to expose 
models as interactive web applications. For example, we might 
want to build an application where there is a UI for specifying 
different inputs to a modcl, and the UI reacts to changes made by 
the user. While Flask can be used to build web pages that react 
to user input, there are libraries built on top of Flask that provide 
higher-level abstractions for building web applications with the 
Python language. 

2.5.1 Dash 

Dasii is a Python library written by the Plotly team than enables 
building interactive web applications with Python. You specify an 
application layout and a set of callbacks that respond to user input. 
If you Ve used Shiny in the past, Dash shares many similarities, 
but is built on Python rather than R. With Dash, you can create 
simple applications as we’ll show here, or complex dashboards that 
internet with machine learning models. 

We’ll create a simple Dash application that provides a UI for in- 
teracting with a model. The application layout will contain three 
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text boxes, where two of these are for user inputs and the third one 
shows the output of the model. We’ll create a file called dash_app.py 
and start by specifying the libraries to import. 


import dash 

import dash_html_components as html 
import dash_core_components as dcc 
from dash.dependencies import Input, Output 
import pandas as pd 
import mlflow.sklearn 


Next, we’ll dehne the layout of our application. We create a Dash 
object and then set the layout held to include a title and three 
text boxes with labeis. We’ll include only 2 of the 10 games from 
the games data set, to keep the sample short. The last step in 
the script launches the web Service and enables connections from 
remote machines. 

app = dash.Dash( _name_) 


app.layout = html.Di v(children=[ 
html.HI (children= 'Model UI 1 ), 
html. P([ 

html.Label ( 1 Game 1 '), 

dcc.Input (value= 1 1 1 , type='text', id='gl'), 

]), 

html. Di v ([ 

html.Label ( 1 Game 2 '), 

dcc.Input (value= '0' , type=' text' , id='g2'), 

]), 

html. P([ 

html.Label ( 1 Prediction '), 

dcc.Input (value= '0' , type='text', id='pred') 
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Model UI 

Game 1 

1 



Game 2 

0 _ 1 

Predictic 


>n 0 



FIGURE 2.2: The initial Dash application. 


i f_ name_== 1 _ main_ 

app.run_server (host= '0.0.0.0 1 ) 


Before writing the callbacks, we can test out the layout of the 
application by running python3 dash_app.py, which will run on port 
8050 by default. You can browse to your public IP on port 8050 
to see the resulting application. The initial application layout is 
shown in Figure 2.2. Before any callbacks are added, the resuit of 
the Prediction text box will always be 0. 

The next step is to add a callback to the application so tliat the 
Prediction text box is updated whenever the user changes one of 
the Game 1 or Game 2 values. To perform this task, we dehne 
a callback shown in the snippet below. The callback is dehned 
after the application layout, but before the run_server command. 
We also load the logistic regression model for the garnes data set 
using MLflow. The callback uses an annotation to dehne the inputs 
to the function, the output, and any additional state that needs 
to be provided. The way that the annotation is dehned here, the 
function will be callcd whenever the value of Game 1 or Game 2 is 
modihed by the user, and the value returned by this function will 
be set as the value of the Prediction text box. 
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model_path = "models/logit_games_vl" 

model = mlflow.sklearn.load_model (model_path) 

@app.callback( 

Output(component_id= 1 pred ' , component_property=' value' ), 
[Input(component_id= 'gl 1 , component_property= 'value' ), 
Input (component_id= 'g2 , component_property= 'value ')] 

) 

def update_prediction (gamel, game2): 

new_row = { "Gl": float(gamel) , 

"G2": float(game2) , 

"G3" : 0, "G4" : 0, 

"G5" : 0, "G6" : 0, 

"G7" : 0, "G8" : 0, 

"G9" : 0, "G10" :0 } 


new_x = pd.DataFrame.from_dict (new_row, 

orient = "index" ) .transpose () 
return str(model.predict_proba (new_x)[0][1]) 


The function takes the two values provided by the user, and creates 
a Pandas dataframe. As before, we transpose the dataframe to 
provide a single row that we’ll pass as input to the loaded model. 
The value predicted by the model is then returned and set as the 
value of the Prediction text box. 

The updated application with the callback function included 
is shown in Figure 2.3. The prediction value now dynamically 
changes in response to changes in the other text fields, and pro¬ 
vides a way of introspecting the model. 

Dasii is great for building web applications, because it eliminates 
the need to write JavaScript code. It’s also possible to stylize Dasii 
application using CSS to add sorne polisli to your tools. 
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Model UI 

Game 1 

1 



Game 2 

0 1 

Predictic 


>n 013479385252140805 



FIGURE 2.3: The resulting model prediction. 


2.6 Conclusion 

The Python ecosystem has a great suite of tools for building web 
applications. Using only Python, you can write scalable APIs de- 
ployed to the open web or custom UI applications that interact 
with backend Python code. This chapter focused on Flask, which 
can be extended with other libraries and hosted in a wide range 
of environments. One of the important concepts we touched on 
in this chapter is model persistence, which will be useful in other 
contexts when building scalable model pipelines. We also deployed 
a simple application to Heroku, which is a separate cloud platform 
from AWS and GCP. 

This chapter is only an introduction to the many different web 
tools within the Python ecosystem, and the topic of scaling these 
types of tools is outside the scope of this book. Instead, we’ll focus 
on managed Solutions for models on the web, which significantly 
reduces the DevOps overhead of deploying models as web Services. 
The next chapter will cover two Systems for serverless functions in 
managed environments. 
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Serverless technologies enable developers to write and deploy code 
without needing to worry about provisioning and maintaining 
servers. One of the most common uses of this technology is server¬ 
less functions, which makes it much easier to author code tliat can 
scale to match variable workloads. With serverless function envi- 
ronments, you write a function that the runtime supports, specify 
a list of dependencies, and then deploy the function to production. 
The cloud platform is responsible for provisioning servers, scaling 
up more machines to match demand, managing load balancers, and 
handling versioning. Since we’ve already explored hosting models 
as web endpoints, serverless functions are an excellent tool to uti- 
lize when you want to rapidly move from prototype to production 
for your predictive models. 

Serverless functions were first introduced on AWS in 2015 and 
GCP in 2016. Both of these Systems provide a variety of triggers 
that can invoke functions, and a number of outputs that the func¬ 
tions can trigger in response. Whilc it’s possible to use serverless 
functions to avoid writing complex code for glueing different com- 
ponents together in a cloud platform, we’ll explore a much nar- 
rower use case in this chapter. Wc’ll write serverless functions that 
are triggered by an HTTP request, calculate a propensity score for 
the passed in feature vector, and return the prediction as JSON. 
For this specihc use case, GCP’s Cloud Functions are much eas¬ 
ier to get up and running, but we’ll explore both AWS and GCP 
Solutions. 

In this chapter, we’ll introduce the concept of managed Services, 
where the cloud platform is responsible for provisioning servers. 
Next, we’ll cover hosting sklearn and Keras models with Cloud 
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Functions. To conclude, we’ll show how to achieve the same re¬ 
suit for sklearn models with Lambda functions in AWS. Wc’ll also 
touch on model updates and access control. 


3.1 Managed Services 

Since 2015, there’s been a movement in cloud computing to transi- 
tion developers away from manually provisioning servers to using 
managed Services that abstract away the concept of servers. The 
main benefit of this new paradigm is that developers can write code 
in a staging environment and then push code to production with 
minimal concerns about operational overhead, and the infrastruc- 
ture required to match the required workload can be automatically 
scaled as needed. This enables both engineers and data scientists 
to be more active in DevOps, because much of the operational 
concerns of the infrastructure are managed by the cloud provider. 

Manually provisioning servers, where you ssh into the machines to 
set up libraries and code, is often referred to as hosted deployments, 
versus managed Solutions where the cloud platform is responsible 
for abstracting away this concern from the user. In this book, we’ll 
cover examples in both of these categories. Here are some of the 
different use cases we’ll cover: 

• Web Endpoints: Single EC2 instance (hosted) vs AWS 
Lambda (managed). 

• Docker: Single EC2 instance (hosted) vs ECS (managed). 

• Messaging: Kafka (hosted) vs PubSub (managed). 

This chapter will walk through the Lrst use case, migrating web 
endpoints from a single machine to an elastic environment. We’ll 
also work through examples that thread this distinction, such as 
deploying Spark environments with specibc machine conbgurations 
and manual cluster management. 

Serverless technologies and managed Services are a powerful tool 
for data scientists, because they enable a single developer to build 
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data pipclines that can scale to massive workloads. It’s a powerful 
tool for data scientists to wield, but there are a few trade-offs to 
consider when using managed Services. Here are some of the rnain 
issues to consider when deciding between hosted and managed 
Solutions: 

• Iteration: Are you rapidly prototyping on a product or iterating 
on a systern in production? 

• Latency: Is a multi-second latency acceptable for your SLAs? 

• Scale: Can your systern scale to match peak workload demands? 

• Cost: Are you willing to pay more for serverless cloud costs? 

At a startup, serverless technologies are great because you have 
low-volume traffic and have the ability to quickly iterate and try 
out new architectures. At a certain scale, the dynamics change 
and the cost of using serverless technologies may be less appealing 
when you already have in-house expertise for provisioning cloud 
Services. In my past projects, the top issue that was a concern was 
latency, because it can irnpact customer experiences. In chapter 8, 
we’ll touch on tliis topic, because managed Solutions often do not 
scale well to large streaming workloads. 

Even if your organization does not use managed Services in daily 
operations, it’s a useful skill set to get hands on with as a data 
scientist, because it means that you can separate model training 
frorn model deployment issues. One of the themes in this book is 
that models do not need to be complex, but it can be complex 
to deploy models. Serverless functions are a great approach for 
demonstrating the ability to serve models at scale, and we’ll walk 
through two cloud platforms that provide this capability. 


3.2 Cloud Functions (GCP) 

Google Cloud Platform provides an environment for serverless 
functions called Cloud Functions. The general concept with this 
tool is that you can write code targeted for Flask, but leverage the 
managed Services in GCP to provide elastic computing for your 
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Python code. GCP is a great environment to get started with 
serverless functions, because it closely matches Standard Python 
development ecosystems, where you specify a requirements file and 
application code. 

We’ll build scalable endpoints tliat serve botli sklearn and Keras 
models with Cloud Functions. There are a few issues to be aware 
of when writing functions in tliis environment: 

• Storage: Cloud Functions run in a read-only environment, but 
you can write to the /tmp directory. 

• Tabs: Spaces versus tabs can cause issues in Cloud Functions, 
and if you are working in the web editor versus familiar tools like 
Sublime Text, these can be difficult to spot. 

• sklearn: When using a requirements file, it’s important to differ¬ 
entiate between sklearn and scik-it-learn based on your imports. 
We’ll use sklearn in tliis chapter. 

Cloud platforms are always changing, so the specihc steps outlined 
in tliis chapter may cliange based on the evolution of these plat¬ 
forms, but the general approach for deploying functions should 
apply throughout these updates. As always, the approach I advo¬ 
cate for is starting with a simple example, and then scaling to 
more complex Solutions as needed. In tliis section, we’ll first build 
an echo Service and then explore sklearn and Keras models. 

3.2.1 Echo Service 

GCP provides a web interface for authoring Cloud Functions. Tliis 
UI provides options for setting up the triggers for a function, spec- 
ifying the requirements file for a Python function, and authoring 
the implcmentation of the Flask function that serves the request. 
To start, we’ll set up a simple echo Service that reads in a param- 
eter from an HTTP request and returns the passed in parameter 
as the resuit. 

In GCP, you can directly set up a Cloud Function as an HTTP 
endpoint without needing to conhgure additional triggers. To get 
started with setting up an echo Service, perform the following ac- 
tions in the GCP console: 
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= Google Cloud Platforrr 

i : 

• gameanalytics ▼ 

(•••J Cloud Functions 

<- Create function 


Name 

echo 


Memory allocated 
256 MB 


Trigger 

HTTP 


URL 

https://us-central1-gameanalytics-199018.cloudfunctions.net/echo 

Authentication 

>/ Allow unauthenticated invocations 

Check thls if you are creatlng a public API or website. 

Thls is a shortcut to assign the IAM Invoker role to the special identifier allUsers. You 
can use l AM to edit this setting after the function is created 


FIGURE 3.1: Creating a Cloud Function. 

1. Search for “Cloud Function” 

2. Click on “Create Function” 

3. Select “HTTP” as the trigger 

4. Select “Allow unauthenticated invocations” 

5. Select “Inline Editor” for source code 

6 . Select Python 3.7 as the runtime 

An example of this process is shown in Figure 3.1. After perform- 
ing these steps, the UI will provide tabs for the main.py and re- 
quirements.txt hies. The requirements hle is where we will specify 
libraries, such as flask >= 1 . 1 . 1 , and the main hle is where we’ll 
implement our function behavior. 

We’ll start by creating a simple echo Service tliat parses out the msg 
parameter from the passed in request and returns this parameter 
as a JSON response. In order to use the jsonify function we need 
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to include the flask library in the requirements file. The require- 
ments.txt file and main.py files for the simple echo Service are shown 
in the snippet below. The echo function here is similar to the echo 
Service we coded in Section 2.1.1, the main distinction here is that 
we are no longer using annotations to specify the endpoints and 
allowed methods. Instead, these settings are now being specified 
using the Cloud Functions UI. 

# requirements.txt 
flask 


#main.py 

def echo (request): 

from flask import jsonify 

data = {"success": False} 
params = request.get_json () 

if "msg" in params: 

data[" response"] = str (params ['msg '] ) 
data ["success"] = True 

return jsonify (data) 


We can deploy the function to production by performing the fol- 
lowing steps: 

1. Update “Function to execute” to “echo” 

2 . Click “Create” to deploy 

Once the function has been deployed, you can click on the “Testing” 
tab to check if the deployment of the function worked as intended. 
You can specify a JSON object to pass to the function, and invoke 
the function by clicking “Test the function”, as shown in Figure 3.2. 
The resuit of running this test case is the JSON object returned 
in the Output dialog, which shows that invoking the echo function 
worked correctly. 
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O echo 

Version 1, deployed at Sep 22, 2019,4:25:05 PM 
General Trigger Source Testing 


Triggering event 

1 {"rasg": ''Hellow World!"} 

2 


Output 

{“response”:"Hellow World!","success”:true} 


fQ 


FIGURE 3.2: Testing a Cloud Function. 


Now that the function is deployed and we enabled unauthenticated 
access to the function, we can call the function over the web using 
Python. To get the URL of the function, click on the “trigger” 
tab. We can use the requests library to pass a JSON object to the 
serverless function, as sliown in the snippet below. 


import requests 
resuit = requests.post ( 

"https://us-centrall-gameanalyti cs.cloudfunctions.net/echo" 
, json = { 'msg' : 'Helio from Cloud Function' }) 

print(result.json()) 


The resuit of running this script is that a JSON payload is returned 
from the serverless function. The output from the call is the JSON 
sliown below. 








64 


3 Models as Serverless Functions 


{ 

'response': 'Helio from Cloud Function', 

'success' : True 

} 

We now have a serverless function that provides an echo Service. In 
order to serve a model using Cloud Functions, we’ll need to persist 
the model specification somewhere that the serverless function can 
access. To accomplish this, we’ll use Cloud Storage to store the 
model in a distributed storage layer. 

3.2.2 Cloud Storage (GCS) 

GCP provides an elastic storage layer called Google Cloud Storage 
(GCS) that can be used for distributed file storage and can also 
scale to other uses such as data lakes. In this section, we’ll explore 
the first use case of utilizing this Service to store and retrieve hies 
for use in a serverless function. GCS is similar to AWS’s offering 
called S3, which is leveraged extensively in the gaming industry to 
build data platforms. 

While GCP does provide a UI for interacting witli GCS, we’ll 
explore the command line interface in this section, since this ap- 
proach is useful for building automated workhows. GCP requires 
authentication for interacting with this Service, please revisit sec¬ 
tion 1.5.1 if you have not yet set up a JSON credentials file. In 
order to interact with Cloud Storage using Python, we’ll also need 
to install the GCS library, using the command shown below: 


pip install —user google-cloud-storage 

export G00GLE_APPLICATI0N_CREDENTIALS=/home/ec2-user/dsdemo.j son 


Now that we have the prerequisite libraries installed and creden¬ 
tials set up, we can interact with GCS programmatically using 
Python. Before we can store a file, we need to set up a bucket on 
GCS. A bucket is a prehx assigned to all hies stored on GCS, and 
each bucket name rnust be globally unique. We’ll create a bucket 
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name called dsp_model_store where we’ll store model objects. The 
script below shows how to create a new bucket using the cre- 
ate_bucket function and then iterate through ali of the available 
buckets using the list_buckets function. You’ll need to change the 
bucket_name variable to something unique before running this script. 


from google.cloud import storage 
bucket_name = "dsp_model_store" 


storage_client = storage.Ciient () 
storage_client.create_bucket (bucket_name) 


for bucket in storage_client.Iist_buckets (): 
print (bucket.name) 

After running this code, the output of the script should be a sin- 
gle bucket, with the name assigned to the bucket_name variable. 
We now have a patii on GCS that we can use for saving hies: 

gs://dsp_model_storage. 

We’ll reuse the model we trained in Section 2.2.1 to deploy a lo- 
gistic regression model with Cloud Functions. To save the file to 
GCS, we need to assign a path to the destination, shown by the 
bucket.blob command below and select a local hle to upload, which 
is passed to the upload function. 


from google.cloud import storage 


bucket_name = "dsp_model_store" 

storage_client = storage.Ciient () 

bucket = storage_client.get_bucket (bucket_name) 

blob = bucket.blob ( "serverless/logit/vl" ) 
blob.upload_from_filename ( "logit.pkl" ) 


After running this script, the local hle logi t. pkl will now be avail¬ 
able on GCS at the following location: 
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gs: //d sp_model_storage/serverless/logit/vl/logit.pkl 


While it’s possible to use URIs such as this directly to access files, 
as we’ll explore with Spark in Chapter 6, in this section wc’ll re- 
trieve the file using the bucket name and blob path. The code 
snippet below shows how to download the model file from GCS to 
local storage. We download the model file to the local path of lo- 
caT._log-it.pkl and then load the model by calling pickle.load with 
this path. 


import pickle 

from google.cloud import storage 


bucket_name = "dsp_model_store" 

storage_client = storage.Ciient () 

bucket = storage_client.get_bucket (bucket_name) 

blob = bucket.blob ( "serverless/logit/vl" ) 
blob.download_to_filename ( "local_logit.pkl" ) 
model = pickle.load (open ( "local_logit.pkl" , 'rb' )) 
model 


We can now programmatically store model hies to GCS using 
Python and also retrieve them, enabling us to load model hies 
in Cloud Functions. We’ll combine this with the Flask examples 
from the previous chapter to serve sklearn and Keras models as 
Cloud Functions. 

3.2.3 Model Function 

We can now set up a Cloud Function that serves logistic regression 
model predictions over the web. We’ll build on the Flask example 
that we explored in Section 2.3.1 and make a few modihcations 
for the Service to run on GCP. The hrst step is to specify the 
required Python libraries that we’ll need to serve requests in the 
requirements.txt hle, as shown below. We’ll also need Pandas to 
set up a dataframe for making the prediction, sklearn for applying 
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the model, and cloud storage for retrieving the rnodel object from 
GCS. 


google-cloud-storage 

sklearn 

pandas 

flask 

The next step is to implement our model function in the mai n. py hle. 
A small change from before is that the params object is now fetched 
using request.get_json () rather than flask.request.args. The main 
change is that we are now downloading the model hle from GCS 
rather than retrieving the hle directly from local storage, because 
local hies are not available when writing Cloud Functions witli the 
UI tool. An additional change from the prior function is that we 
are now reloading the model for every request, rather than loading 
the model hle once at startup. In a later code snippet, we’ll show 
how to use global objects to cache the loaded model. 


def pred (request): 

from google.cloud import storage 

import pickle as pk 

import sklearn 

import pandas as pd 

from flask import jsonify 

data = {"success": False} 
params = request.get_json () 

if "Gl" in params: 


new_row = { "Gl" 
"G3" 
"G5" 
"G7" 
"G9" 


params.get ( "Gl" ) 
params.get ( "G3" ) 
params.get ( "G5" ) 
params.get ( "G7" ) 
params.get ( "G9" ) 


,"G2": params.get ( "G2" ), 
,"G4": params.get ( "G4" ) , 
,"G6": params.get ( "G6" ), 
,"G8": params.get ( "G8" ), 
,"G10":params. get ( "G10" )} 
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new_x = pd.DataFrame.from_dict (new_row, 

orient = "index" ). transpose () 

# set up access to the GCS bucket 
bucket_name = "dsp_model_store" 
storage_client = storage.Ciient () 

bucket = storage_client.get_bucket (bucket_name) 

# download and load the model 

blob = bucket.blob ( "servertess/logit/vl") 

blob.download_to_filename ( "/tmp/local_logit.pkl" ) 

model = pk.load (open (" /tmp/local_logit.pkl" , 'rb 1 )) 

data ["response"] = str (model .predict_proba (new_x)[0][1]) 
data ["success"] = True 

return jsonify (data) 


One note in the code snippet above is that the /tmp directory is 
used to store the downloaded model file. In Cloud Functions, you 
are unable to write to the local disk, with the exception of this 
directory. Generally it’s best to read objects directly into memory 
rather than pulling objects to local storage, but the Python library 
for reading objects from GCS currently requires this approach. 

For this function, we created a new Cloud Function named pred, 
set the function to execute to pred, and deployed the function to 
production. We can now call the function from Python, using the 
same approach from 2.3.1 with a URL that now points to the 
Cloud Function, as shown below: 


import requests 


resuit = requests.post ( 

"https://us-centrall-gameanalyti cs.cloudfunctions.net/pred" 
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, j son = { 'G1' : '1' , 'G2' : 'G' , 1 G3 1 : 1 0 1 , 1 G4 1 : 1 0' , 1 G5 1 : 1 0 1 

,' G6' : '0' , 'G7':'Q', 1 G8' : 1 0' , 1 G9 1 : 1 0 1 , 'G10' : '0’ }) 
print ( resuit.j son ()) 


The resuit of the Python web request to the function is a JSON 
response with a response value and model prediction, shown below: 

{ 

1 response' : 1 0.06745113592634559 1 , 

1 success' : True 

} 

In order to improve the performance of the function, so that it 
takes milliseconds to respond rather than seconds, we’ll need to 
cache the model object between runs. It’s best to avoid defining 
variables outside of the scope of the function, because the server 
hosting the function may be terminated due to inactivity. Global 
variables are an execution to this rule, when used for caching ob- 
jects between function invocations. This code snippet below shows 
how a global model object can be defined within the scope of the 
pred function to provide a persistent object across calls. During 
the first function invocation, the model hle will be retrieved from 
GCS and loaded via pickle. During following runs, the model ob¬ 
ject will already be loaded into memory, providing a much faster 
response time. 


model = None 


def pred (request): 
global model 


if not model: 


# download model from GCS 

model = pk.load (open ( "/tmp/local_logit.pkl" , 'rb 1 )) 
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# apply model 


return jsonify (data) 


Caching objects is important for authoring responsive models that 
lazily load objects as needed. It’s also useful for more complex mod¬ 
els, such as Keras which requires persisting a TensorFlow grapli 
between invocations. 

3.2.4 Keras Model 

Since Cloud Functions provide a requirements file that can be used 
to add additional dependencies to a function, it’s also possible to 
serve Keras models with this approach. We’ll be able to reuse most 
of the code from the past section, and we’ll also use the Keras 
and Flask approach introduced in Section 2.3.2. Given the size of 
the Keras libraries and dependencies, we’ll need to upgrade the 
memory available for the Function from 256 MB to 1GB. We also 
need to update the requirements file to include Keras: 


google-cloud-storage 

tensorflow 

keras 

pandas 

flask 

The full implementation for the Keras model as a Cloud Function 
is shown in the code snippet below. In order to make sure that the 
TensorFlow graph used to load the model is available for future 
invocations of the model, we use global variables to cache both 
the model and graph objects. To load the Keras model, we need 
to redehne the auc function that was used during model training, 
which we include within the scope of the predict function. We 
reuse the sanie approach from the prior section to download the 
model file from GCS, but now use ioad_modei from Keras to read 
the model file into memory from the temporary disk location. The 
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resuit is a Keras predictive rnodel that lazily fetches the model file 
and can scale to meet variable workloads as a serverless function. 


model = None 
graph = None 

def predict (request): 
globat model 
globat graph 

from google.cloud import storage 
import pandas as pd 
import flask 
import tensorflow as tf 
import keras as k 

from keras.models import load_model 
from flask import jsonify 

def auc(y_true, y_pred): 

auc = tf.metrics.auc (y_true, y_pred)[1] 
k.backend.get_session () .run ( 

tf.locat_variables_initializer()) 

return auc 

data = {"success": False} 
params = request.get_json () 

# download model if not cached 
if not model: 

graph = tf.get_default_graph () 

bucket_name = "dsp_model_store_l" 
storage_client = storage.Ciient () 
bucket = storage_client.get_bucket (bucket_name) 


blob = bucket.blob ( "serverless/keras/vl" ) 
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blob.download_to_filename ( "/tmp/games.h5" ) 
model = load_model ( 1 /tmp/games.h5' , 

custom_objects={' auc' :auc}) 


# apply the model 
i f "61" in params: 
new_row = { "61" 
"63" 
"65" 
"67" 
"69" 


params.get ("61"),"62": params.get ( "62" ), 
params.get ("63"),"64": params.get ( "64" ), 
params.get ("65"),"66": params.get ( "66" ), 
params.get ( "67" ), "68" : params.get ( "68" ), 
params.get ("69") ,"610":params. get ( "610" )} 


new_x = pd.DataFrame.from_dict (new_row, 

orient = "index" ) .transpose () 


with graph.as_default (): 

data[" response"] = str (model .predict_proba (new_x)[0][0]) 
data ["success"] = True 

return jsonify (data) 


To test the deployed model, we can reuse the Python web request 
script from the prior section and replace pred with predi ct in the 
request URL. We have now deployed a deep learning model to 
production. 

3.2.5 Access Control 

The Cloud Functions we introduced in this chapter are open to 
the web, which means that anyone can access them and poten- 
tially abuse the endpoints. In general, it’s best not to enable unau- 
thenticated access and instead lock down the function so that only 
authenticated users and Services can access them. This recommen- 
dation also applies to the Flask apps that we deployed in the last 
chapter, where it’s a best practice to restrict access to Services that 
can reach the endpoint using AWS private IPs. 
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There are a few different approaches for locking down Cloud Func¬ 
tions to ensure that only authenticated users have access to the 
functions. The easiest approach is to disable “Allow unauthenti- 
cated invocations” in the function setup to prevent hosting the 
function on the open web. To use the function, you’ll need to set 
up IAM roles and credentials for the function. Tliis process involves 
a number of steps and may change over time as GCP evolves. In- 
stead of walking through this process, it’s best to refer to the GCP 
documentation 1 . 

Another approach for setting up functions that enforce authenti- 
cation is by using other Services within GCP. We’ll explore this 
approach in Chapter 8, which introduces GCP’s PubSub systern 
for producing and consuming messages within GCP’s ecosystem. 

3.2.6 Model Refreshes 

We’ve deployed sklearn and Keras models to production using 
Cloud Functions, but the current implementations of these func¬ 
tions use static model hies that will not change over time. It’s 
usually necessary to make changes to models over time to ensure 
that the accuracy of the models do not drift too far from expected 
performance. There’s a few different approaches that we can take 
to update the model specihcation that a Cloud Function is using: 

1. Redeploy: Overwriting the model hle on GCS and rede- 
ploying the function will resuit in the function loading the 
updated hle. 

2. Timeout: We can add a timeout to the function, where 
the model is re-downloaded after a certain threshold of 
time passes, such as 30 minutes. 

3. New Function: We can deploy a new function, such as 
pred_v2 and update the URL used by Systems calling the 
Service, or use a load balancer to automate this process. 

4. Model Trigger: We can add additional triggers to the 
function to force the function to manually reload the 
model. 


Uttps ://cloud.google.com/functions/docs/securing/authenticati ng 
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While the first approach is the easiest to implement and can work 
well for small-scale deployments, the third approach, where a load 
balancer is used to direct calls to the newest function available 
is probably the most robust approach for production Systems. A 
best practice is to add logging to your function, in order to track 
predictions over time so that you can log the performance of the 
model and identify potential drift. 


3.3 Lambda Functions (AWS) 

AWS also provides an ecosystem for serverless functions callcd 
Lambda. AWS Lambda is useful for glueing different components 
within an AWS deployment together, since it supports a rich set 
of triggers for function inputs and outputs. While Lambda does 
provide a powerful tool for building data pipelines, the current 
Python development environment is a bit clunkier than GCP. 

In this section we’ll walk through setting up an echo Service and 
an sklearn model endpoint with Lambda. We won’t cover Keras, 
because the size of the library causes problems when deploying a 
function with AWS. Unlike the past section where we used a UI 
to dehne functions, we’ll use command line tools for providing our 
function dehnition to Lambda. 

3.3.1 Echo Function 

For a simple function, you can use the inline code editor that 
Lambda provides for authoring functions. You can create a new 
function by performing the following steps in the AWS console: 

1. Under “Find Services”, select “Lambda” 

2. Select “Create Function” 

3. Use “Author frorn scratclY 

4. Assign a name (e.g. echo) 

5. Select a Python runtime 

6. Click “Create Function” 
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After running these steps, Lambda will generate a file callcd 
lambda_functi on. py. The file defines a function called lambda_handler 
which we’ll use to implcment the echo Service. We’ll rnake a small 
modification to the file, as shown below, which echoes the msg pa- 
rameter as the body of the response object. 


def lambda_handler (event, context) : 


return { 

'statusCode' : 200, 

'body': event['msg'] 

} 

Click “Save” to deploy the function and then “Test” to test the 
file. If you use the default test parameters, then an error will be 
returned when running the function, because no msg key is available 
in the event object. Click on “Configure test event”, and define use 
the following configuration: 

{ 

"msg": "Helio from Lambda!" 

} 

After clicking on “Test”, you should see the execution results. The 
response should be the echoed message with a status code of 200 
returned. There’s also details about how long the function took to 
execute (25.8ms), the billing duration (lOOms), and the maximum 
memory used (56 MB). 

We have now a simple function running on AWS Lambda. For tliis 
function to be exposed to external Systems, we’ll need to set up an 
API Gateway, which is covered in Section 3.3.3. Tliis function will 
scale up to meet demand if needed, and requires no server mon- 
itoring once deployed. To setup a function that deploys a model, 
we’ll need to use a different workflow for authoring and publishing 
the function, because AWS Lambda does not currently support a 
requ-irements.txt file for defining dependencies when writing fune- 
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tions with the inline code editor. To store the model file that we 
want to serve with a Lambda function, we’ll use S3 as a storage 
layer for model artifacts. 

3.3.2 Simple Storage Service (S3) 

AWS provides a highly-performant storage layer called S3, which 
can be used to host individual files for web sites, store large files 
for data processing, and even host thousands or millions of hies 
for building data lakes. For now, our use case will be storing an 
individual zip file, which we’ll use to deploy new Lambda functions. 
However, there are many broader use cases and many companies 
use S3 as their initial endpoint for data ingestion in data platforms. 

In order to use S3 to store our function to deploy, we’ll need to 
set up a new S3 bucket, dehne a policy for accessing the bucket, 
and conhgure credentials for setting up command line access to S3. 
Buckets on S3 are analogous to GCS buckets in GCP. 

To set up a bucket, browse to the AWS console and select “S3” un¬ 
der hnd Services. Next, select “Create Bucket” to set up a location 
for storing hies on S3. Create a uni que name for the S3 bucket, as 
shown in Figure 3.3, and click “Next” and then “Create Bucket” 
to hnalize setting up the bucket. 

We now liave a location to store objects on S3, but we stili need 
to set up a user before we can use the command line tools to 
write and read frorn the bucket. Browse to the AWS console and 
select “IAM” under “Find Services”. Next, click “Users” and then 
“Add user” to set up a new user. Create a user name, and select 
“Programmatic access” as shown in Figure 3.4. 

The next step is to provide the user with full access to S3. Use 
the attach existing policies option and searcli for S3 policies in 
order to hnd and select the AmazonS3FuiiAccess policy, as shown in 
Figure 3.5. Click “Next” to continue the process until a new user 
is dehned. At the end of tliis process, a set of credentials will be 
displayed, including an access key ID and secret access key. Store 
these values in a safe location. 
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Name and region 

Name and region 
Bucket name 


dsp-testing 


Region 


US East (NI. Virginia) 


Copy settings from an existing bucket 


Seiect bucket (optional)! BucKets 



Create bucket 

(g) Configureoptions (3) Set permissions 




FIGURE 3.3: Creating an S3 bucket 011 AWS. 


Add user 


Set user details 

You can add multiple users at once with the same access type and permissions. Leam more 
User name’ s3_lambda 

O Add another user 


2 3 4 5 


Seiect AWS access type 

Seiect how these users will access AWS. Access keys and autogenerated passwords are provided in the last step. Leam more 
Access type* (✓) Programmatic access 

Enables an access key ID and secret access key for the AWS API, CLI, SDK, and 
other development tools. 

AWS Management Console access 

Enables a password that allows users to sign-in to the AWS Management Console. 

FIGURE 3.4: Setting up a user with S3 access. 


The last step needed for setting up command line access to S3 
is running the aws configure command from your EC2 instance. 
You’11 be asked to provide the access and secret keys from the user 
we just set up. I 11 order to test that the credentials are properly 
configured, you can run the following commands: 


aws configure 
aws s3 Is 
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Add user 

- Set permissions 



Add user to group 

£ Copy permissions from 
existing user 

Attach existing policies 
tj directly 


Create policy 



c 

Filter policies v Q s3 



Showing 5 results 


Policy name ▼ 

TVpe 

Used as 

Description 

► 

II AmazonDMSRedshi... 

AWS managed 

None 

Provides access to manage S3 settings for... 

@ > 

SI AmazonS3FullAccess 

AWS managed 

Permissions policy (2) 

Provides full access to all buckets via the A... 

► 

•I AmazonS3ReadOnl... 

AWS managed 

None 

Provides read only access to all buckets via ... 

► 

AWSLambdaS3Exe... 

Customer managed 

Permissions policy (1) 


► 

II QuickSightAccessF... 

AWS managed 

None 

Policy used by QuickSight team to access c... 


FIGURE 3.5: Selecting a policy for full S3 access. 


The results should include the name of the S3 bucket we set up 
at the beginning of tliis section. Now that we have an S3 bucket 
set up with command line access, we can begin writing Lambda 
functions that use additional libraries such as Pandas and sklearn. 

3.3.3 Model Function 

In order to author a Lambda function that uses libraries outside of 
the base Python distribution, you’ll need to set up a local environ- 
ment that defines the function and includes ali of the dependencies. 
Once your function is defined, you can upload the function by cre- 
ating a zip file of the local environment, uploading the resulting hle 
to S3, and configuring a Lambda function from the hle uploaded 
to S3. 

The hrst step in this process is to create a directory with all of the 
dependencies installed locally. Wliile it’s possible to perform this 
process on a local machine, I used an EC2 instance to provide a 
clean Python environment. The next step is to install the libraries 
needed for the function, which are Pandas and sklearn. These li- 
braries are already installed on the EC2 instance, but need to be 
reinstalled in the current directory in order to be included in the 
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zip file that we’ll upload to S3. To accomplish this, we can append 
-t . to the end of the pip command in order to install the libraries 
into the current directory. The last steps to run on the command 
line are copying our logistic regression rnodel into the current di¬ 
rectory, and creating a new file that will implement the Lambda 
function. 


mkdir lambda 
cd lambda 

pip install pandas -t . 
pip install sklearn -t . 
cp ../logit.pkl logit.pkl 
vi logit.py 


The full source code for the Lambda function that serves our lo¬ 
gistic regression rnodel is shown in the code snippet below. The 
structure of the file should look familiar, we first globally dehne a 
rnodel object and then implement a function that Services rnodel 
requests. This function first parses the response to extract the in- 
puts to the rnodel, and then calls predict_proba on the resulting 
dataframe to get a rnodel prediction. The resuit is then returned 
as a dictionary object containing a body key. It’s important to de¬ 
hne the function response within the body key, otherwise Lambda 
will throw an exception when invoking the function over the web. 


from sklearn.externals import joblib 
import pandas as pd 
import json 

rnodel = jobiib.load (' logit.pkl' ) 

def lambda_handler (event, context): 

# read in the request body as the event dict 
if "body" in event: 

event = event ["body"] 
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-if 

event is 

not 

None: 





event = 

j son 

.loads (event) 



else : 







event = 






Gl" 

in event: 





new 

_row = { 

"Gl" 

: event["Gl"" 

, "G2" : 

event [" 

G2"] , 



"G3" 

: event["G3"" 

, "G4" : 

event[" 

G4"] , 



"G5" 

: event["G5"" 

, "G6" : 

event [" 

G6"] , 



"G7" 

: event["G7"" 

, "G8" : 

event[" 

G8"] , 



"G9" 

: event["G9"" 

, "G10" 

event[" 

G10"] } 


new_x = pd.DataFrame.from_dict (new_row, 

orient = "index" ) .transpose () 
prediction = str(model.predict_proba (new_x)[0][1]) 


return { "body": "Prediction " + prediction } 
return { "body": "No parameters" } 

Unlike Cloud Functions, Lambda functions authored in Python 
are not built on top of the Flask library. Instead of requiring a 
single parameter (request), a Lambda function requires event and 
context objects to be passed in as function parameters. The event 
includes the parameters of the request, and the context provides in- 
formation about the execution environment of the function. When 
testing a Lambda function using the “Test” functionality in the 
Lambda console, the test conhguration is passed directly to the 
function as a dictionary in the event object. However, when the 
function is called frorn the web, the event object is a dictionary 
tliat describes the web request, and the request parameters are 
stored in the body key in this dict. The first step in the Lambda 
function above checks if the function is being called directly from 
the console, or via the web. If the function is being called from 
the web, then the function overrides the event dictionary with the 
content in the body of the request. 
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One of the main differences from this approach with the GCP 
Cloud Function is that we did not need to explicitly define global 
variables that are lazily defined. With Lambda functions, you can 
define variables outside the scope of the function that are persisted 
before the function is invoked. It’s important to load model objects 
outside of the model Service function, because reloading the model 
each time a request is made can become expensive when handling 
large workloads. 

To deploy the model, we need to create a zip file of the current 
directory, and upload the file to a location on S3. The snippet 
below shows how to perform these steps and then conhrm that 
the upload succeeded using the s3 is command. You’11 need to 
modify the paths to use the S3 bucket name that you defined in 
the previous section. 


zip -r logitFunction.zip . 

aws s3 cp logitFunction.zip s3: //dsp-ch3-logi t/logitFunction.zip 
aws s3 Is s3: //ds p-ch3-logit/ 


Once your function is uploaded as a zip file to S3, you can return 
to the AWS console and set up a new Lambda function. Select 
“Author from scratch” as before, and under “Code entry type” se¬ 
lect the option to upload from S3, specifying the location from the 
cp command above. You’11 also need to define the Handier, which is 
a combination of the Python file name and the Lambda function 
name. An example conhguration for the logit function is sliown in 
Figure 3.6. 

Make sure to select the Python runtime as the sanie version of 
Python that was used to run the pip commands on the EC2 in- 
stance. Once the function is deployed by pressing “Save”, we can 
test the function using the following dehnition for the test event. 

{ 

"Gl": "1", "G2" : "1", "G3" : "1", 

"G4" : "1", "G5" : "1", 
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logit 

Throttle 

Qualifiers ▼ 

Actions ▼ 

logit 

▼ 

Test 



Function code mfo 


© The deployment package of your Lambda function "logit" is too large to enable inline code editing. However, 
you can stili invoke your function. 


Code entry type 

Runtime 


Handler mfo 

Upload a file from ... ▼ 

Python 3.6 

▼ 

logit.lambda_handler 

Amazon S3 link URL 

Paste an S3 link URL to your function code .zip. 




s3://dsp-ch3-logit/logitFunction.zip 





FIGURE 3.6: Defining the logit function on AWS Lambda. 


"G6" : "1", "G7" : "1", "G8": "1", 

"G9" : "1", "G10" : "1" 

} 

Since the model is loaded when the function is deployed, the re¬ 
sponse time for testing the function should be relatively fast. An 
example output of testing the function is shown in Figure 3.7. The 
output of the function is a dictionary that includes a body key and 
the output of the model as the value. The function took 110 ms 
to execute and was billed for a duration of 200 ms. 

So far, we’ve invoked the function only using the built-in test func- 
tionality of Lambda. In order to host the function so that other 
Services can interact witli the function, we’ll need to dehne an API 
Gateway. Under the “Designer” tab, click “Add Trigger” and se- 
lect “API Gateway”. Next, select “Create a new API” and choose 
“Open” as the security setting. After setting up the trigger, an 
API Gateway should be visible in the Designer layout, as shown 
in Figure 3.8. 

Before calling the function from Python code, we can use the API 
Gateway testing functionality to make sure that the function is 
set up properly. One of the challenges I ran into when testing this 
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logit 


Qualifiers ▼ 

| Actions ▼ 

logit2 ▼ 

Test 


® Execution resuit: succeeded (logs) 

▼ Details 

The area below shows the resuit returned by your function execution. Learn more about returning results from 
your function. 

{ 

"body": "Prediction 0.39762327867823355” 

} 

Summary 

Code SHA-256 Request ID 

zo+w9KlcvMugqW3W9Hsh9t8XJFu208ZfdJxG+pDz5A 78db5a5c-51a4-4684-a7b5-392d28ea7e66 


Duration 
110.66 ms 


Billed duration 
200 ms 


FIGURE 3.7: Testing the logit function on AWS Lambda. 


▼ Designer 



IjSjl API Gateway 


+ Add trigger 


H [o & 

0 Saved 


Layers 

(0) 


Amazon CloudWatch Logs 


Resources that the function's role has access to 
appear here 


FIGURE 3.8: Setting up an API Gateway for the function. 
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4 - Method Execution 


/logit - ANY - Method Test 


Make a test call to your method with the provided input 


Method 
POST . 

Path 

No path parameters exist for this resource. You 
can detine path parameters by using the syntax 
{myPathParam} in a resource path. 

Query Strings 
{logit} 

paraml =vaiue 1 &param2=value2 


Request: /logit 
Status: 200 
Latency: 101 ms 
Response Body 

Prediction 0.39762327867823355 

Response Headers 

{"X-Amzn-Trace-Id": "Root=l-5d8f9544-7121cb984fd3f54959f53 
388;Sampled=0"} 

Logs 


FIGURE 3.9: Testing post commands on the Lambda f u n et, ion. 


Lambda function was that the structure of the request varies when 
the function is invoked frorn the web versus the console. This is 
why the function first checks if the event object is a web request 
or dictionary with parameters. When you use the API Gateway to 
test the function, the resulting call will emulate calling the function 
as a web request. An example test of the logit function is shown 
in Figure 3.9. 

Now that the gateway is set up, we can call the function frorn a 
remote liost using Python. The code snippet below shows how to 
use a POST command to call the function and display the resuit. 
Since the function returns a string for the response, we use the 
text attribute rather than the json function to display the resuit. 


import requests 


resuit = requests .post("https://3z5btf0ucb.execute-api.us-east-1. 

amazonaws.com/default/logit" , 
json = { 'G1' : '1' , 'G2 1 : 1 0 1 , 'G3' : '0 1 , 1 G4 1 : 1 0 1 , 1 G5 1 : 1 0' , 

'G6' : '0' , 'G7' : '0' , 'G8' : ' 0 1 , 1 G9 1 : 1 0 1 , ' G10' : '0' }) 


pri nt (resuit.text) 
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We now have a predictive model deployed to AWS Lambda that 
will autoscale as necessary to match workloads, and which requires 
minimal overhead to maintain. 

Similar to Cloud Functions, there are a few different approaches 
tliat can be used to update the deployed models. However, for 
the approach we used in this section, updating the model requires 
updating the model hle in the development environment, rebuild- 
ing the zip hle and uploading it to S3, and then deploying a new 
version of the model. This is a manual process and if you expect 
frequent model updates, then it’s better to rewrite the function so 
that it fetches the model dehnition frorn S3 directly rather than 
expecting the hle to already be available in the local context. The 
most scalable approach is setting up additional triggers for the 
function, to notify the function that it’s time to load a new model. 


3.4 Conclusion 

Serverless functions are a type of managed Service that enable 
developers to deploy production-scale Systems without needing to 
worry about infrastructure. To provide this abstraction, different 
cloud platforms do place constraints on how functions must be 
implcmented, but the trade-oh is generally worth the improvement 
in DevOps that these tools enable. Wliile serverless technologies 
like Cloud Functions and Lambda can be operationally expensive, 
they provide hexibility that can offset these costs. 

In this chapter, we implemented echo Services and sklearn model 
endpoints using both GCP’s Cloud Functions and AWS’s Lambda 
offerings. With AWS, we created a local Python environment with 
all dependencies and then uploaded the resulting hies to S3 to 
deploy functions, whilc in GCP we authored functions directly 
using the oniine code editor. The best System to use will likcly 
depend on which cloud provider your organization is already using, 
but when prototyping new systems, it’s useful to have liands on 
expericnce using more than one serverless function ecosystem. 
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Containers for Reproducible Models 


When deploying data Science models, it’s important to be able to 
reproduce the sanie environment used for training and the environ- 
ment used for serving. In Chapter 2, we used the same machine for 
both environments, and in Chapter 3 we used a requirements.txt 
file to ensure tliat the serverless ecosystem used for model serving 
matched our development environment. Container Systems such as 
Docker provide a tool for building reproducible environments, and 
they are much lighter weight than alternative approaches such as 
Virtual machines. 

The idea of a Container is that it is an isolated environment in 
which you can set up the dependencies that you need in order to 
perform a task. The task can be performing ETL work, serving 
ML models, standing up APIs, or hosting interactive web appli- 
cations. The goal of a Container framework is to provide isolation 
between instances witli a lightweight footprint. With a Container 
framework, you specify the dependencies that your code needs, and 
let the framework handle the legwork of managing different execu- 
tion environments. Docker is the de facto Standard for containers, 
and there is substantial tooling built this platform. 

Elastic Container environments, such as Elastic Container Service 
(ECS) provide similar functionality to serverless functions, where 
you want to abstract away the notion of servers from hosting data 
Science models. The key differentiation is that serverless ecosys- 
tems are restricted to specihc runtimes, often liave memory limi- 
tations that make it challenging to use deep learning frameworks, 
and are cloud specihc. With ECS, you are responsible for setting 
up the types of instances used to serve models, you can use what- 
ever languages needed to serve the model, and you can take up 
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as much memory as needed. ECS stili has the problem of being a 
proprietary AWS tool, but newer options such as EKS build upon 
Kubernetes which is open source and portable. 

Here are sorne of the data Science use cases I’ve seen for containers: 

• Reproducible Analyses: Containers provide a great way of 
packaging up analyses, so that other team members can rerun 
your work months or years later. 

• Web Applications: In Chapter 2 we built an interactive web 
application with Dash. Containers provide a great way of ab- 
stracting away hosting concerns for deploying the app. 

• Model Deployments: If you want to expose your model as an 
endpoint, containers provide a great way of separating the model 
application code from model serving infrastructure. 

The focus of tliis chapter will be the last use case. We’ll take our 
web endpoint from Chapter 2 and wrap the application in a Docker 
Container. We’ll start by running the Container locally on an EC2 
instance, and then explore using ECS to create a scalable, load- 
balanced, and fault-tolerant deployment of our model. We’ll then 
show how to achieve a similar resuit on GCP using Kubernetes. 

Now that we are exploring scalable compute environments, it’s 
important to keep an eye on cloud costs when using ECS and GKE. 
For AWS, it’s useful to keep an eye on how many EC2 instances are 
provisioned, and on GCP the billing tool provides good tracking 
of costs. The section on orchestration is specific to AWS and uses 
an approach that is not portable to different cloud environments. 
Feel free to skip directly to the section on Kubernetes if AWS is 
not a suitable environment for your model deployments. 


4.1 Docker 

Docker, and other platform-as-a-service tools, provide a virtual- 
ization concept called containers. Containers run on top of a host 
operating system, but provide a standardized environment for code 
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running within the Container. One of the key goals of this virtual- 
ization approach is that you can write code for a target environ- 
ment, and any System running Docker can run your Container. 

Containers are a lightweight alternative to Virtual machines, which 
provide similar functionality. The key difference is that containers 
are much faster to spin up, wliile providing the sanie level of iso- 
lation as Virtual machines. Another benefit is that containers can 
re-use layers frorn other containers, making it much faster to build 
and share containers. Containers are a great solution to use when 
you need to run conflicting versions of Python runtimes or libraries, 
on a single machine. 

With docker, you author a file called a Dockerfile that is used to 
define the dependencies for a Container. The resuit of building the 
Dockerhle is a Docker Image, which packages ali of the runtimes, 
libraries, and code needed to run an app. A Docker Container 
is an instantiated image that is running an application. One of 
the useful features in Docker is that new images can build off 
of existing images. For our model deployment, we’ll extend the 
ubuntu:latest illiage. 

This section will show how to set up Docker on an EC2 instance, 
author a Dockerhle for building an image of the echo Service frorn 
Chapter 2, build an image using Docker, and run a Container. To 
install Docker on an EC2 instance, you can use the amazon-linux- 
extras tool to sinrplify the process. The commands below will in¬ 
stall Docker, start the Service on the EC2 instance, and list the 
running Containers, which will return an ernpty list. 

sudo yum install -y python3-pip python3 python3-setuptools 
sudo yum update -y 

sudo amazon-linux-extras install docker 
sudo Service docker start 
sudo docker ps 

The application we’ll deploy is the echo Service frorn Chapter 2. 
This Service is a Flask application that parses the msg attribute 


90 


4 Containers for Reproducible Models 


from a GET or POST and returns a JSON payload echoing the 
provided message. The only difference from the prior application 
is that the Flask app now runs on port 80, shown by the last line 
in the echo.py snippet below. 

# load Flask 
import flask 

app = flask. Flask( _name_) 

# define a predict function as an endpoint 
@app . route(" /predict" , methods= ["GET" , "POST"] ) 
def predict (): 

data = {"success": False} 

# get the request parameters 
params = flask.request.json 
if (params == None): 

params = flask.request.args 

# i f parameters are found, echo the msg parameter 
if (params != None): 

data[" response"] = params.get ( "msg" ) 
data ["success"] = True 

# return a response "in json format 
return flask.jsonify (data) 

# start the flask app, allow remote connections 
app.run (host=' 0.0.0.0 1 , port = 80) 


Now that we have Docker installed and an application that we want 
to containerize, we need to write a Dockerhle that describes how 
to build an irnage. A Dockerhle for performing this task is shown 
in the snippet below. The hrst step is to use the from command 
to identify a base irnage to use. The ubuntu irnage provides a linux 
environment that supports the apt-get command. The maintainer 
command adds to the metadata information associated with the 
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image, adding the name of the image maintainer. Next, the run 
command is used to install Python, set up a symbolic link, and 
install Flask. For containers with many Python librarios, it’s also 
possible to use a requirements.txt hle. The Copy command inserts 
our script into the image and places the hle in the root directory. 
The hnal command specihes the arguments to run to execute the 
application. 

FROM ubuntu:latest 
MAINTAINER Ben Weber 

RUN apt-get update \ 

&& apt-get install -y python3-pip python3-dev \ 

&& cd /usr/local/bin \ 

&& In -s /usr/bin/python3 python \ 

&& pip3 install flask 


COPY echo.py echo.py 


ENTRYPOINT ["python3","echo.py"] 

After writing a Dockerhle, you can use the build command that 
docker provides to create an image. The hrst command shown 
in the snippet below shows how to build an image, tagged as 
echo_service, using the hle . /Dockerfile. The second command 
shows the list of Docker images available on the instance. The 
output will show both the ubuntu image we used as the base for 
our image, and our newly created image. 

sudo docker image build -t "echo_service" . 
sudo docker images 

To run an image as a Container, we can use the run command 
shown in the snippet below. The -d hag specihes that the Container 
should run as a daemon process, which will continue to run even 
when shutting down the terminal. The -p hag is used to map a 
port on the host machine to a port that the Container uses for 
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communication. Without this setting, our Container is unable to 
receive external connections. The ps command shows the list of 
running containers, which should now include the echo Service. 

sudo docker run -d -p 80:80 echo_service 
sudo docker ps 


To test the Container, we can use the same process as before where 
we use the external IP of the EC2 instance in a web browser and 
pass a msg parameter to the /pred-ict endpoint. Since we set up 
a port mapping from the liost port of 80 to the Container port 
80, we can directly invoke the Container over the open web. An 
example invocation and resuit from the echo Service Container is 
shown below. 


http://34.237.242. 46/predict?msg=Hi_from_docker 
{"response" : "Hi_from_docker" ,"success":true} 


We liave now walked through the process of building a Docker 
image and running the image as a Container on an EC2 instance. 
While this approach does provide a solution for isolating differ¬ 
ent Services on a machine, it does not provide scaling and fault- 
tolerance, which are typically requirements for a production-grade 
model deployment. 


4.2 Orchestration 

Container orchestration Systems are responsible for managing the 
life cycles of containers in a cluster. Tliey provide Services including 
provisioning, scaling, failover, load balancing, and Service discovery 
between containers. AWS has multiple orchestration Solutions, but 
the general trend has been moving towards Kubernetes for this 
functionality, which is an open-source platform originally designed 
by Google. 
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One of the main reasons for using Container orchestration as a data 
scientist is to be able to deploy models as containers, where you 
can scale up infrastrnctnre to match dernand, have a fanlt-tolerant 
systern that can recover from errors, and have a static URL for 
your Service managed by a load balancer. It’s a bit of work to 
get one of these Solutions np and running, but the end resuit is 
a robust model deployment. Serverless functions provide a similar 
capabilit.y, but using orchestration Solutions means that you can 
use whatever programming language and runtime is necessary to 
serve your models. It provides flexibility at the cost of operational 
overhead, but as these tools evolve the overhead should be reduced. 

The best solution to use for orchestration depends on your use case 
and if you have constraints on the cloud platform that you can use. 
If your goal is to use Kubernetes, then GCP provides a fully man¬ 
aged solution out of the box, wliich we’ll cover in the next section. 
If you are restricted to AWS, then the two options are Elastic 
Container Service (ECS) and Elastic Kubernetes Service (EKS). 
If you’re just getting started with Kubernetes, then AWS provides 
less tools for getting containers up and running through a web con- 
sole. Within ECS there are options for manually specifying EC2 
instance types to run, or using the new Fargate mode to abstract 
away managing EC2 instances. My general recommendation is to 
learn Kubernetes if you want to get started with orchestration. But 
if you need to run a containerized model at scale in AWS, then 
ECS is currently the path of least friction. 

In tliis section, we’ll walk through deploying our model image to 
the AWS Container Registry, show how to set up a task in an ECS 
cluster to run the image as a Container, and then set up a load- 
balanced Service the provides managed provisioning for the echo 
Service. 

4.2.1 AWS Container Registry (ECR) 

In order to use your Docker image in an orchestration systern, you 
need to push your image to a Docker registry that works with 
the platform. For ECS, the AWS implementation of this Services 
is called AWS Elastic Container Registry (ECR). ECR is a man- 
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Repositories 




View push commands 


Delete 


' ' f I Create repository 


Q, Find repositories 


< i > © 


Repository 

URI 


Tag 

name A 



immutability 


g 1 l.dkr.ecr.us-east- 

10/26/19, 01:11:09 



1 .amazonaws.com/models 

PM 



FIGURE 4.1: A model repository on ECR. 


aged Docker registry that you can use to store and manage irnages 
within the AWS ecosystem. It works well with both ECS and EKS. 

The goal of this subsection is to walk through the process of getting 
a Docker image from an EC2 instance to ECR. We’ll cover the 
following steps: 

1. Setting up an ECR repository 

2. Creating an IAM role for ECR 

3. Using docker login 

4. Tagging an image 

5. Pushing an image 

The first step is to create a repository for the image that we want 
to store on ECR. A registry can have multiple repositories, and 
each repository can have multiple tagged irnages. To set up a new 
repository, perform the following steps from the AWS console: 

1. Search for and select “ECR” 

2. On the left panel, click on “Repositories” 

3. Select “Create Repository” 

4. Assign a repository name, e.g. “models” 

5. Click on “Create Repository” 

After completing these steps, you should have a new repository for 
saving irnages on ECR, as shown in Figure 4.1. The repository will 
initially be empty until we push a Container. 

Since our goal is to push a Container from an EC2 instance to 
ECR, we’ll need to set up permissions for pushing to the registry 
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from the command line. To set up these permissions, we can add 
additional policies to the s3_iambda user that we first created in 
Chapter 3. Perform the following steps from the AWS console: 

1. Searcli for and select “IAM” 

2. Click on Users 

3. Select the “s3_lambda” user 

4. Click “Add permissions” 

5. Choose “AmazonEC2ContainerRegistryFullAccess” 

6. Select “Attach existing policies directly” 

7. Click on “Add permissions” 

We now have permissions to write to ECR from the user account 
we previously set up. However, before we can run the docker login 
command for pushing images to ECR, we need to set up temporary 
credentials for accessing ECR. We can create a token by running 
the following command on the EC2 instance: 


aws ecr get-login --region us-east-1 --no-include-emai1 


The output of running this task is a command to run from the 
command line that will enable temporary access for storing images 
to ECR. For this to work on the EC2 instance, you’ll need to 
prepend sudo to the output from the following step and then run 
the generated command, as shown below. 

sudo docker login -u AWS -p eylwYXlsb2FkIjoiVy9vYWVrYnY0YlVqTFp... 

ff the authentication is successful, the output from this command 
should be Log-in Succeeded. Now that we have successfully used the 
docker login command, we can now push to the ECR repository. 
Before we can push our local irnage, we need to tag the image with 
account specihc parameters. The snippet below shows how to tag 
the echo Service with an AWS account ID and region, with a tag 
of echo. For these parameters, region is a string such as us-east-i, 
and the account ID is the number specihed under Account id in the 
“My Account” tab from the AWS console. 
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FIGURE 4.2: The echo image in the ECR repository. 


sudo docker tag echo_service 

[acount_id].dkr.ecr.[region].amazonaws.com/models:echo 


After tagging your image, it’s good to check that the outcome 
matches the expected behavior. To check the tags of your images, 
run sudo docker images from the commaiid line. An example output 
is shown below, with my account ID and region omitted. 


REPOSITORY TAG 

amazonaws.com/models echo 
echo_service latest 

ubuntu latest 


IMAGE ID 

CREATED 

SIZE 

3380f2a8805b 

3 

hours ago 

473MB 

3380f2a8805b 

3 

hours ago 

473MB 

cf0f3ca922e0 

8 

days ago 

64.2MB 


The final step is to push the tagged image to the ECR repository. 
We can accomplish this by running the command shown below: 

sudo docker push [acount_id].dkr.[region].amazonaws.com/models:echo 

After running this command, the echo Service should now be avail- 
able in the model repository on ECR. To check if the process suc- 
ceeded, return the the AWS console and click on “Images” for the 
model repository. The repo should now should an image with the 
tag models:echo, as shown in Figure 4.2. 

The outcome of this process is that we now have a Docker image 
pushed to ECR that can be leveraged by an orchestration System. 
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While we did walk through a number of AWS specific steps, the 
process of using Docker login applies to other cloud platforms. 

4.2.2 AWS Container Service (ECS) 

AWS provides an elastic Container Service callcd ECS that is a good 
platform for getting started with Container management. While it 
is a proprietary technology, it does introduce many concepts that 
are usefnl to consider in an orchestrated cloud deployment. Here 
are some concepts exposed through an ECS deployment: 

• Cluster: A cluster defines the environment used to provision 
and liost containers. 

• Task: A task is an instantiation of a Container than performs a 
specific workload. 

• Service: A Service manages a task and provisions new machines 
based on demand. 

Now that we have an image liosted on ECR, the next step is to 
use ECS to provide a scalable deployment of this image. In this 
section, we’ll walk through the following steps: 

1. Setting up a cluster 

2. Setting up a task 

3. Running a task 

4. Running a Service 

At the end of this section, we’ll have a Service that manages a task 
running the echo Service, but we’ll connect directly to the IP of 
the provisioned EC2 instance. In the next section, we’ll set up a 
load balancer to provide a static URL for accessing the Service. 

The first step in using ECS is to set up a Service. There is a newer 
feature in ECS called Fargate that abstracts away the notion of 
EC2 instances when running a cluster, but this mode does not cur- 
rently support the networking rnodes that we need for connecting 
directly to the Container. To set up an ECS cluster, perform the 
following steps frorn the AWS console: 
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Cluster : echo DeleteCluster 

Get a detailed view of the resources on your cluster. 


Status ACTIVE 

Registered Container instances 1 

Pending tasks count 0 Fargate, 0 EC2 
Running tasks count 0 Fargate, 0 EC2 
Active Service count 0 Fargate, 0 EC2 
Draining Service count 0 Fargate, 0 EC2 


Services Tasks ECS Instances Metrics Scheduled Tasks Tags 


Update 

Delete Actions ’ 

Last updated on October 26, 2019 3:36:21 PM (Om ago) 

T Filter in this page 

Launch type ALL 

▼ Service type ALL 

- 

Service Name 


Stat... Servi... Task... 

Laun... 




Desi... Run... 


FIGURE 4.3: An empty ECS cluster. 

1. Search for and select “ECS” 

2. Click on “Clusters” on the left 

3. Select “Create Cluster” 

4. Use the “EC2 Linux + Networking” option 

5. Assign a name, “dsp” 

6. Select an instance type, m3.medium 

7. For “VPC”, select an existing VPC 

8. For “IAM Role”, use “aws-elasticbeanstalk-ec2-role” 

9. Click “Create” to start the cluster 

We now have an empty cluster with no tasks or Services dehned, 
as shown in Figure 4.3. While the cluster has not yet spun up any 
EC2 instances, there is an liourly billing amount associated with 
running the cluster, and any EC2 instances that do spin up will 
not be eligible for the free-tier option. 

The next step is to debne a task, which specihes the irnage to 
use and a number of different settings. A task can be executed di- 
rectly, or managed via a Service. Tasks are independent of Services 
and a single task can be used across multiple Services if necessary. 
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To set up a task, we’ll first specify the execution environment by 
performing the following steps: 

1. From the ECS console, click “Task Definitioris” 

2. Select “Create a new Task Definition” 

3. Select EC2 launch type 

4. Assign a name, “echo_task” 

5. For “Task Role”, select “ecsTaskExecutionRole” 

6. For “Network Mode”, use “Bridge” 

7. Select “ecsTaskExecutionRole” for Task execution role 

8. Set “Task Memory” and “Task CPU” to 1024 

We then need to specify details about the Docker image that we 
want to host when running the task: 

1. Click “Add Container” 

2. Assign a name “echo_service” 

3. For “Image”, use the URI shown in Figure 4.2 

4. Add a port mapping of host:80, container:80 

5. Click “Add” to finalize Container setup 

6. Click “Create” to define the task 

We now liave a task setup in ECS that we can use to host our 
image as a Container in the cloud. It is a good practice to test our 
your tasks in ECS before defining a Service to manage the task. 
We can test out the task by performing the following steps: 

1. From the ECS console, click on your cluster 

2. Click on the “Tasks” tab 

3. Select “Run new Task” 

4. Use EC2 as the “Launch type” 

5. For “Task Definition”, use “echo_task:l” 

6. Click “Run Task” 

The resuit of setting up and running the echo Service as a task 
is shown in Figure 4.4. The ECS cluster is now running a single 
task, which is hosting our echo Service. To provision this task, the 
ECS cluster will spin up a new EC2 instance. If you browse to 
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Services Tasks ECS Instances Metrics Scheduled Tasks Tags 
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FIGURE 4.4: The echo task running on the ECS cluster. 
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FIGURE 4.5: Network bindings for the echo Service Container. 


the EC2 console in AWS, you’ll see tliat a new EC2 instance has 
been provisioned, and the name of the instance will be based on 

the Service, such as: ECS Instance - EC2ContainerService-echo. 

Now that we have a Container running in our ECS cluster, we can 
query it over the open web. To hnd the URL of the Service, click 
on the running task and under containers, expand the echo Service 
details. The console will show an external IP address where the 
Container can be accessed, as shown in Figure 4.5. An example of 
using the echo Service is shown in the snippet below. 


http://18.212.21. 97/predict?msg=Hi_from_ECS 


{"response":"Hi_from_ECS" ,"success":true} 

We now have a Container running in the cloud, but it’s not scalable 
and there is no fault tolerance. To enable these types of capabilities 
we need to dehne a Service in our ECS cluster than manages the 
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Service : echo_ecs 

Cluster 
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Service type 
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EC2 


Service role 
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< 1-1 Page size 50 
Desired status Group Launch type 

RUNNING service:echo ecs EC2 


FIGURE 4.6: The ECS Service running a single echo task. 


lifecycle of a task. To set up a Service for the task, perform the 
following steps: 

1. From the ECS console, click on your cluster 

2. Select the “Services” tab 

3. Click “Create” 

4. Use EC2 launch type 

5. Select “echo_task” as the task 

6. Assign a “Service name”, “echo_ecs” 

7. Use “None” for load balancer type 

8. Click “Create Service” 

This will start the Service. The Service will set the “Desired count” 
value to 1, and it may take a few minutes for a new task to get 
ramped up by the cluster. Once “Running count” is set to 1, you 
can start using the Service to host a model. An example of the 
provisioned Service is shown in Figure 4.6. To find the IP of the 
Container, click on the task within the Service definition. 

We now have a Container that is managed by a Service, but we’re 
stili accessing the Container directly. In order to use ECS in a way 
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Create Load Balancer 


Actions v 

Q Filter by tags and attributes or search by keyword 


■ Name 
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FIGURE 4.7: The Application Load Balancer configuration. 


tliat scales, we need to set up a load balancer that provides a static 
URL and routes requests to active tasks managed by the Service. 

4.2.3 Load Balancing 

There’s a number of different load balancer options in AWS that 
are useful for different deployments. This is also an area where the 
options are rapidly changing. For our ECS cluster, we can use the 
application load balancer to provide a static URL for accessing 
the echo Service. To set up a load balancer, perform the following 
steps from the AWS console: 

1. Search for and select “EC2” 

2. Select “Load Balancer” on the left 

3. Click “Create Load Balancer” 

4. Choose “Application Load Balancer” 

5. Assign a name, “model-service” 

6. Use the default VPC 

7. Create a new security group 

8. Set source to “Anywhere” 

9. Create a new target group, “model-group” 

10. Click “Create” 

An example of a provisioned application load balancer is shown in 
Figure 4.7. The last step needed to set up a load-balanced Container 
is configuring an ECS Service that uses the load balancer. Repeat 
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the same steps shown in the prior section for setting up a Service, 
but instead of selecting “None” for the load balancer type, perform 
the following actions: 

1. Select “Application” for Load Balance type 

2. Select the “model-service” balancer 

3. Select the “echo_service” Container 

4. Click “Add to Load Balancer” 

5. Select “model-group” as the target group 

6. Click “Create Service” 

It’s taken quite a few steps, but our echo Service is now running in 
a scalable environment, using a load balancer, and using a Service 
that will manage tasks to handle failures and provision new EC2 
instances as necessary. This approach is quite a bit of conbguration 
versus Lambda for similar functionality, but this approach may be 
preferred based on the type of workload that you need to handle. 
There is cost involved witli running an ECS cluster, even if you 
are not actively servicing requests, so understanding your expected 
workload is useful when modeling out the cost of different rnodel 
serving options on AWS. 


http: //modell23 .us-east-l.elb.amazonaws.com/predict?msg=Hi_from_ELB 


{"response" : "Hi_from_ELB" ,"success":true} 

AWS does provide an option for Kubernetes called EKS, but the 
options available through the web console are currently limited for 
managing Docker irnages. EKS can work with ECR as well, and 
as the AWS platform evolves EKS will likely be the best option 
for new deployments. 

Make sure to terminate your cluster, load balancers, and EC2 in¬ 
stances once you are done testing out your deployment to reduce 
your cloud platform costs. 
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4.3 Kubernetes on GCP 

Google Cloud Platform provides a Service called Google Kuber- 
netes Engine (GKE) for serving Docker containers. Kubernetes is 
a container-orchestration system originally developed by Google 
tliat is now open source. There are a wide range of use cases for 
tliis platform, but we’ll focus on the specific task of hosting our 
echo Service using managed Kubernetes. 

Using Kubernetes for hosting a Docker Container is similar to ECS, 
where the first step is to save your irnage to a Docker registry that 
can interface with the orchestration system. The GCP version of 
this registry Service is called Container Registry. To get our im- 
age from an EC2 instance on AWS to the GCP Container Registry, 
we’ll again use the docker login command. For this process to work, 
you’11 need the GCP credentials json hle that we set up in Chap- 
ter 1. The code snippet below shows how to pass the json hle to 
the docker login command, tag the image for uploading it to the 
registry, and push the image to Container Registry. 


cat dsdemo.json | sudo docker login -u _json_key 

--password-stdin https://us.gcr.io 
sudo docker tag echo_service us.gcr.io/[gcp_account]/echo_service 
sudo docker push us.gcr.io/[gcp_account]/echo_service 


You’11 need to replace the gcp_acount parameter in this script with 
your full google account ID. After performing these steps, the echo 
Service image should be visible under the Registry view in the GCP 
console, as shown in Figure 4.8. Typically, if you are using GCP 
for serving models, it’s likely that you’11 be using Google Compute 
instances rather than EC2, but it’s good to get practice interfacing 
between components in different cloud platforms. 

The process for hosting a Container with GKE is strcamlined com- 
pared to all of the steps needed for using ECS. We’ll hrst use the 
GCP console to set up a Container on Kubernetes, and then expose 
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Repositories C Refresh 


gameanalytics 

Fi Iter AII hostnames ▼ 

Name * Hostname Visibilrty 

0 echo.service us.gcr.io Private 


FIGURE 4.8: The echo image on GCP Container Registry. 



Kubernetes clusters □ create cluster q deploy c refresh i delete 


A Kubernetes cluster is a managed group of VM instances for running containerized applications. Learn more 
Filter by label or name 

Name ~ Location Cluster size Totalcores Totalmemory Notifications Labeis 

& echo-gke-cluster us-centrall-a 3 3vCPUs 11.25 GB Connect ✓* ■ 

FIGURE 4.9: The echo image deployed via Kubernetes. 


the Service to the open web. To deploy the echo Service Container, 
perform the following steps frorn the GCP console: 

1. Search for and select “Kubernetes Engine” 

2. Click “Deploy Container” 

3. Select “Existing Container Image” 

4. CllOOSe “echo_service:latest’ 

5. Assign an application name “echo-kge” 

6. Click “Deploy” 

We now liave a Kubernetes cluster deployed and ready to serve the 
echo Service. The deployment of a Kubernetes cluster on GKE can 
take a few minutes to set up. Once the deployment is completed, 
you should see the echo cluster under the list of clusters, as shown 
in Figure 4.9. 
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4- Load baiancer details / edit ■ delete 
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Frontend 
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Backend 
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gke-echo-gke-cluster-default-pool-41 dc0b98-kp4v & 

gke-echo-gke-cluster-default-pool-41 dc0b98-jbv5 & 


FIGURE 4.10: The echo Service deployed to the open web. 


To use the Service, we’ll need to expose the cluster to the open 
web by performing the following steps from the GCP console: 

1. From the GKE menu, select your cluster 

2. Click on “Workloads” 

3. Choose the “echo-gke” workload 

4. Select the “Actions” tab and then “Expose” 

5. For Service type, select “load baiancer” 

After performing these steps, the cluster will conhgure an external 
IP that can be used to invoke the Service, as shown in Figure 
4.10. GKE will automatically load balance and scale the Service as 
needed to match the workload. 


http://35.238.43. 63/predict?msg=Hi_from_GKE 
{"response" : "Hi_from_GKE" ,"success":true} 


An example of using the Service is shown in the snippet above. 
We were able to quickly take a Docker irnage and deploy it in a 
Kubernetes ecosystem using GKE. It’s good to build experience 
with Kubernetes for hosting Docker irnages, because it is a portable 
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solution that works across multiple cloud environments and it is 
being adopted by many open-source projects. 


4.4 Conclusion 

Containers are great to use to make sure that your analyses and 
models are reproducible across different environments. While con¬ 
tainers are useful for keeping dependencies clean on a single ma- 
chine, the main benefit is that they enable data scientists to write 
model endpoints without worrying about how the Container will 
be hosted. This separation of concerns makes it easier to partner 
with engineering teams to deploy models to production, or using 
the approaches shown in this chapter, data and applied Science 
teams can also own the deployment of models to production. 

The best approach to use for serving models depends on your 
deployment environment and expected workload. Typically, you 
are constrained to a specific cloud platform when working at a 
company, because your model Service may need to interface with 
other components in the cloud, such as a database or cloud stor- 
age. Within AWS, there are multiple options for hosting contain¬ 
ers while GCP is aligned on GKE as a single solution. The main 
question to ask is whether it is more cost effective to serve your 
model using serverless function technologies or elastic Container 
technologies. The correct answer will depend on the volume of 
trafhc you need to liandle, the amount of latency that is tolera- 
ble for end users, and the complexity of models that you need to 
liost. Containerized Solutions are great for serving complex models 
and making sure that you can meet latency requirements, but may 
require a bit more DevOps overhead versus serverless functions. 
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Model pipelines are usually part of a broader data platform that 
provides data sources, such as lakes and warehouses, and data 
stores, such as an application database. When building a pipeline, 
it’s useful to be able to schedule a task to run, ensure that any de- 
pendencies for the pipeline liave already completed, and to backhll 
historic data if needed. Wliile it’s possible to perform these types 
of tasks manually, there are a variety of tools that have been de- 
veloped to improve the management of data Science workflows. 

In this chapter, wc’ll explore a bateh model pipeline that per- 
forrns a sequence of tasks in order to train and store results for 
a propensity model. This is a different type of task than the de- 
ployments we’ve explored so far, which have focused on serving 
real-time model predictions as a web endpoint. In a bateh process, 
you perform a set of operations that store model results that are 
later served by a different application. For example, a bateh model 
pipeline may predict which users in a game are likely to cliurn, and 
a game server fetehes predictions for each user that starts a session 
and provides personalized offers. 

When building bateh model pipelines for production Systems, it’s 
important to rnake sure that issues with the pipeline are quickly 
resolved. For example, if the model pipeline is unable to feteh the 
rnost recent data for a set of users due to an upstream failure 
with a database, it’s useful to have a system in place that can 
send alerts to the tearn that owns the pipeline and that can rerun 
portions of the model pipeline in order to resolve any issues with 
the prerequisite data or model outputs. 

Workflow tools provide a solution for managing these types of 
problems in model pipelines. With a workflow tool, you specify 
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the operations that need to be completed, identify dependencies 
between the operations, and then schedule the operations to be 
performed by the tool. A workfiow tool is responsible for running 
tasks, provisioning resources, and monitoring the status of tasks. 
There’s a number of open source tools for building workflows in- 
cluding Airflow, Luigi, MLflow, and Pentaho Kettle. We’ll focus 
on Airflow, because it is being widely adopted across companies 
and cloud platforms and are also providing fully-managed versions 
of Airflow. 

In this chapter, we’ll build a batch model pipcline that runs as a 
Docker Container. Next, we’ll schedule the task to run on an EC2 
instance using cron, and then explore a managed version of cron 
using Kubernetes. In the third section, we’ll use Airflow to define a 
graph of operations to perform in order to run our model pipeline, 
and explore a cloud offering of Airflow. 


5.1 Sklearn Workfiow 

A common workfiow for batch model pipelines is to extract data 
from a data lake or data warehouse, train a model on historic user 
behavior, predict future user behavior for more recent data, and 
then save the results to a data warehouse or application database. 
In the gaming industry, this is a workfiow I’ve seen used for build¬ 
ing likelihood to purchase and likelihood to churn models, where 
the garne servers use these predictions to provide different treat- 
ments to users based on the model predictions. Usually libraries 
like sklearn are used to develop models, and languages such as 
PySpark are used to scale up to the full player base. 

It is typical for model pipelines to require other ETLs to run in a 
data platform before the pipcline can run on the most recent data. 
For example, there may be an upstream step in the data platform 
that translates json strings into schematized events that are used 
as input for a model. In this situation, it miglit be necessary to 
rerun the pipeline on a day that issues occurred witli the json 
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transformation process. For this section, wc’ll avoid this complica- 
tion by using a static input data source, but the tools that we’ll 
explore provide the functionality needed to handle these issues. 

There’s typically two types of batch model pipelines that can I’ve 
seen deployed in the gaming industry: 

• Persistent: A separate training workflow is used to train models 
frorn the one used to build predictions. A model is persisted 
between training runs and loaded in the serving workflow. 

• Transient: The same workflow is used for training and serving 
predictions, and instead of saving the model as a file, the model 
is rebuilt for each run. 

In this section we’ll build a transient batch pipclinc, where a new 
model is retrained with each run. This approach generally results 
in more compute resources being used if the training process is 
heavyweight, but it lielps avoid issues with model drift, wliich is 
useful to track. We’ll author a pipeline that performs the following 
steps: 

1. Fetches a data set from GitHub 

2. Trains a logistic regression model 

3. Applies the regression model 

4. Saves the results to BigQuery 

The pipeline will execute as a single Python script that performs all 
of these steps. For situations where you want to use intermediate 
outputs from steps across multiple tasks, it’s useful to decompose 
the pipeline into multiple processes that are integrated through a 
workflow tool sucli as Airflow. 

We’ll build this workflow by first writing a Python script that runs 
on an EC2 instance, and then Dockerize the script so that we can 
use the Container in workflows. To get started, we need to install 
a library for writing a Pandas dataframe to BigQuery: 


pip install —user pandas_gbq 
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Next, we’ll create a file called pipeiine.py that performs the four 
pipeline steps identified above. The script shown below performs 
these steps by loading the necessary libraries, fetching the CSV 
file frorn GitHub into a Pandas dataframe, splits the dataframe 
into train and test groups to simulate historic and more recent 
users, builds a logistic regression model using the training data 
set, creates predictions for the test data set, and saves the resulting 
dataframe to BigQuery. 


import pandas as pd 
import numpy as np 

from google.oauth2 import service_account 
frorn sklearn.Iinear_model import LogisticRegression 
from datetime import datetime 
import pandas_gbq 

# fetch the data set and add IDs 

gamesDF = pd.read_csv ("https://github.com/bgweber/Twitch/raw/ 

master/Recommendations/games-expand.csv" ) 
gamesDF[' User_ID 1 ] = gamesDF.index 

gamesDF[' New_User' ] = np.floor (np .random.randint (0, 10, 

gamesDF.shape[0])/9) 


# train and test groups 

train = gamesDF[gamesDF[ 1 New_User 1 ] == 0] 

x_train = train.iloc[:,0:10] 

y_train = train [' label'] 

test = gameDF[gamesDF[' New_User' ] == 1] 

x_test = test.iloc[:,0:10] 

# build a model 

model = LogisticRegression () 

model.fit (x_train, y_train) 

y_pred = model.predict_proba (x_test)[:, 1] 


# build a predictions dataframe 
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resultDF = pd.DataFrame( { 1 User_ID' :test[ 'User_ID'] , 'Pred' :y_pred}) 
resultDF [ 'time' ] = str (datetime. now()) 


# save predictions to BigQuery 
table_id = "dsp_demo.user_scores" 
project_id = "gameanalytics-123" 
credentials = servi ce_account.Credenti ais. 

from_service_account_file ( 'dsdemo.j son 1 ) 
pandas_gbq.to_gbq (resultDF, table_id, project_id=project_id, 

if_exists = 'replace', credentials=credentiais) 

To simulate a real-world data set, the script assigns a user_iD at¬ 
tribute to each record, which represents a uuique ID to track differ¬ 
ent users in a System. The script also splits users into historic and 
recent groups by assigning a New_user attribute. After building pre- 
dictions for each of the recent users, we create a results dataframe 
with the user ID, the rnodel prediction, and a timestamp. It’s use- 
ful to apply timestamps to predictions in order to determine if the 
pipeline has completed successfully. To test the rnodel pipeline, 
run the following statements on the command line: 

export G00GLE_APPLICATI0N_CREDEN1IALS= 

/home/ec2-user /dsdemo.j son 
python3 pipeline.py 


If successfully, the script should create a new data set on BigQuery 
callcd dsp_demo, create a new table callcd user_users, and fili the 
table with user predictions. To test if data was actually populated 
in BigQuery, run the following commands in Jupyter: 


from google.cloud import bigquery 
client = bigquery.Ciient () 

sql = "select * from dsp_demo.user_scores" 
client.query (sql) .to_dataframe () .head() 
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FIGURE 5.1: Querying the uploaded predictions in BigQuery. 


This script will set up a clicnt for connecting to BigQuery and then 
display the resuit set of the query submitted to BigQuery. You can 
also browse to the BigQuery web UI to inspect the results of the 
pipeline, as shown in Figure 5.1. We now liave a script that can 
fetch data, apply a machine learning model, and save the results 
as a single process. 

With many workfiow tools, you can run Python code or bash 
Scripts directly, but it’s good to set up isolated environments for 
executing Scripts in order to avoid dependency conflicts for differ¬ 
ent libraries and runtimes. Luckily, we explored a tool for this in 
Chapter 4 and can use Docker with workfiow tools. It’s useful to 
wrap Python scripts in Docker for workfiow tools, because you can 
add libraries that may not be installed on the system responsible 
for scheduling, you can avoid issues with Python version conflicts, 
and containers are becoming a common way of dehning tasks in 
workfiow tools. 

To containerize our workfiow, we need to dehne a Dockerhle, as 
shown below. Since we are building out a new Python environ- 
ment from scratch, we’ll need to install Pandas, sklearn, and the 
BigQuery library. We also need to copy credentials from the EC2 
instance into the Container so that we can run the export com- 
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mand for authenticating with GCP. This works for short term 
deployments, but for longer running containers it’s better to run 
the export in the instantiated Container rather than copying static 
credentials into images. The Dockerfile lists out the Python li- 
braries needed to run the script, copies in the local hies needed for 
execution, exports credentials, and specihes the script to run. 

FROM ubuntu:latest 
MAINTAINER Ben Weber 

RUN apt-get update \ 

&& apt-get install -y python3-pip python3-dev \ 

&& cd /usr/local/bin \ 

&& In -s /usr/bin/python3 python \ 

&& pip3 install pandas \ 

&& pip3 install sklearn \ 

&& pip3 install pandas_gbq 


COPY pipeline.py pipeline.py 

COPY /home/ec2-user/dsdemo.json dsdemo.json 

RUN export GOOGLE_APPLICATION_CREDENTIALS=/dsdemo.j son 

ENTRYPOINT ["python3","pipeline.py"] 

Before deploying this script to production, we need to build an 
image from the script and test a sarnple run. The commands below 
show how to build an image from the Dockerfile, list the Docker 
images, and run an instance of the model pipeline image. 

sudo docker image build -t "sklearn_pipeline" . 

sudo docker images 

sudo docker run sklearn_pipeline 


After running the last command, the containerized pipeline should 
update the model predictions in BigQuery. We now have a model 
pipeline that we can run as a single bash command, which we 
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now need to schedule to run at a specific frequency. For testing 
purposes, we’ll run the script every minute, but in practice models 
are typically executed hourly, daily, or weekly. 


5.2 Cron 

A connnon requirement for model pipelines is running a task at 
a regular frequency, such as every day or every hour. Cron is a 
utility that provides scheduling functionality for machines running 
the Linux operating system. You can Set up a scheduled task using 
the crontab utility and assign a cron expression that dehnes how 
frequently to run the command. Cron jobs run directly on the 
machine where cron is utilized, and can make use of the runtimes 
and libraries installed on the system. 

There are a number of challenges witli using cron in production- 
grade systems, but it’s a great way to get started with scheduling 
a small number of tasks and it’s good to learn the cron expression 
syntax that is used in many scheduling systems. The main issue 
with the cron utility is that it runs on a single machine, and does 
not natively integrate with tools such as version control. ff your 
machine goes down, then you’11 need to recreate your environment 
and update your cron table on a new machine. 

A cron expression dehnes how frequently to run a command. It is 
a sequence of 5 numbers that dehne when to execute for different 
time granularities, and it can include wildcards to always run for 
certain time periods. A few sample expressions are shown in the 
snippet below: 

# run every minute 

* ★ ★ ★ ★ 


# Run at 10am UTC everyday 
0 10 * ★ ★ 
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# Run at 04:15 on Saturday 
15 4 * * 6 

When getting started with cron, it’s good to use tools 1 to validate 
your expressions. Cron expressions are used in Airflow and many 
other scheduling systems. 

We can use cron to schedule our model pipeline to run on a reg- 
ular frequency. To schedule a command to run, run the following 
command on the console: 

crontab -e 

This command will open up the cron table file for editing in vi. 
To schedule the pipeline to run every minute, add the following 
commands to the file and save. 

# run every minute 

# * * * * sudo docker run sklearn_pipeline 


After exiting the editor, the cron table will be updated with the 
new command to run. The second part of the cron statement is the 
command to run. When dehning the command to run, it’s useful 
to include full file paths. With Docker, we just need to dehne the 
image to run. To check that the script is actually executing, browse 
to the BigQuery U1 and check the time column on the user_scores 
model output table. 

We now have a utility for scheduling our model pipeline on a regu- 
lar schedule. However, if the machine goes down then our pipeline 
will fail to execute. To liandle this situation, it’s good to explore 
cloud offerings with cron scheduling capabilities. 


Tttps : //crontab. guru/ 
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5.2.1 Cloud Cron 

Both AWS and GCP provide managed cron options. Witli AWS 
it’s possible to schedule Services to rnn on ECS using cron ex- 
pressions, and the GCP Knbernetes Engine provides scheduling 
support for containers. In this section, we’ll explore the GKE op- 
tion because it is simpler to set np. The first step is to push our 
model pipcline irnage to Container Registry using the following 
commands: 


cat dsdemo.json | sudo docker login -u _json_key 

--password-stdin https://us.gcr. io 
sudo docker tag sklearn_pipeline 

us.gcr.io/[gcp_account]/sklearn_pipeline 
sudo docker push us.gcr.io/[gcp_account]/sklearn_pipeline 

Next, we’ll set up Kubernetes to schedule the image to run on 
a managed cluster. Unlike Section 4.3 where we created a clus- 
ter from an image directly, we’ll use Kubernetes control (kubectl) 
commands to set up a scheduled Container. Run the following com¬ 
mands from the GCP console to create a GKE cluster: 

1. Searcli for and select “Kubernetes Engine” 

2. Click “Create Cluster” 

3. Select “Your first cluster” 

4. Click “Create” 

To schedule a task on this cluster, we’ll use a YAML file to con- 
figure our task and then use a kubectl command to update the 
cluster with the model pipeline. Perform the following steps from 
the GKE UI to get access to the cluster and run a file editor: 

1. Select the new cluster 

2. Click on “Connect” 

3. Select “Run in Cloud Shell” 

4. Run the generated gcloud command 

5. Run vi sklearn.yaml 
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This will provide terminal access to the console and allow us to 
save a YAML file on the cluster. The YAML file below shows 
how to dehne a task to run on a schednle with the specihed cron 
expression. 


api Version: batch/vlbetal 
kind: Crondob 
metadata: 

name: sklearn 
spec: 

schedule: "* * * * *" 
j obTemplate: 
spec: 

template: 
spec: 

containers: 

- name: sklearn 

image: us.gcr.io/[gcp_account]/sklearn_pipeline 
restartPolicy: OnFailure 


After saving the file, we can use kubecti to update the cluster with 
the YAML file. Run the command below to update the cluster 
with the model pipclinc task: 

bgweber@cloudshell: ~ (dsp)$ kubecti apply -f sklearn.yaml 
cronjob.batch/sklearn created 


To view the task, click on “workloads”. The events tab provides a 
log of the runs of the task, as shown in Figure 5.2. Again, we can 
validate that the pipeline is running successfully by browsing to 
the BigQuery UI and checking the time attribute of predictions. 

There’s a variety of scheduling options for cloud platforms, and we 
explored just one option with GKE. GCP also provides a Service 
called Cloud Scheduler, but this system does not work directly 
with Container Registry. Kubernetes is a good approach for han- 
dling scheduling in a cloud deployment, because the system is also 
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<- Cron Job details crefresh / edit «delete iisuspend ordnnow bkubectl •. 


© sklearn 

Details Events YAML 

Message 

Saw completed job: sklearn-1572833880 
Created job skleam-1572833880 
Saw completed job: sklearn-1572833820 
Created job skleam-1572833820 


Reason First Seen LastSeen v Count 

SawCompletedJob Nov 3,2019.6:18:13 Nov 3.2019.6:18:13 1 

PM PM 

SuccessfulCreate Nov 3.2019.6:18:03 Nov 3.2019.6:18:03 1 

PM PM 

SawCompletedJob Nov 3,2019.6:17:12 Nov 3.2019.6:17:12 1 

PM PM 

SuccessfulCreate Nov 3,2019.6:17:02 Nov 3.2019.6:17:02 1 

PM PM 


FIGURE 5.2: Execution events for the pipcline task on GKE. 


responsible for provisioning machines that the scheduled task will 
use to execute. 


5.3 Workfiow Tools 

Cron is useful for simple pipelines, but runs into challenges when 
tasks have dependencies on other tasks which can fail. To hclp re¬ 
solve this issue, where tasks have dependencies and only portions 
of a pipcline need to be rerun, we can leverage workfiow tools. 
Apache Airflow is currently the most popular tool, but other open 
source projects are available and provide similar functionality in- 
cluding Luigi and MLflow. 

There are a few situations where workfiow tools provide benefits 
over using cron directly: 

• Dependencies: Workfiow tools define graphs of operations, 
which makes dependencies explicit. 

• Backfills: It may be necessary to run an ETL on old data, for 
a range of different dates. 

• Versioning: Most workfiow tools integrate with version control 
systems to manage graphs. 

• Alerting: These tools can send out emails or generate PageDuty 
alerts when failures occur. 
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Workflow tools are particularly useful in environments where dif¬ 
ferent teams are scheduling tasks. For example, rnany game com- 
panies have data scientists that schedule model pipelines which 
are dependent on ETLs schednled by a seperate engineering team. 

In this section, we’ll schedule our task to run an EC2 instance 
using liosted Airflow, and then explore a fully-managed version of 
Airflow on GCP. 

5.3.1 Apache Airflow 

Airflow is an open source workflow tool that was originally devel- 
oped by Airbnb and publically released in 2015. It lielps solve a 
challenge that many companies face, which is scheduling tasks that 
have many dependencies. One of the core concepts in this tool is 
a graph that dehnes the tasks to perform and the relationships 
between these tasks. 

In Airflow, a graph is referred to as a DAG, which is an acronym 
for directed acyclic graph. A DAG is a set of tasks to perform, 
where each task has zero or more upstream dependencies. One of 
the constraints is that cycles are not allowed, where two tasks have 
upstream dependencies on each other. 

DAGs are set up using Python code, which is one of the differences 
from other workflow tools such as Pentaho Kettle which is GUI 
focused. The Airflow approach is called “configuration as code”, 
because a Python script dehnes the operations to perform within 
a workflow graph. Using code instead of a GUI to configure work- 
flows is useful because it makes it much easier to integrate with 
version control tools such as GitHub. 

To get started with Airflow, we need to install the library, initialize 
the Service, and run the scheduler. To perform these steps, run the 
following commands on an EC2 instance or your local machinc: 


export AIRFL0W_H0ME=~/ai rflow 
pip install —user apache-airflow 
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FIGURE 5.3: The Airflow web app running on an EC2 instance. 


airflow initdb 
airflow scheduler 


Airflow also provides a web frontend for managing DAGs that have 
been scheduled. To start this Service, run the following command 
in a new terminal on the sanie machine. 

airflow Webserver -p 8080 

This command telis Airflow to start the web Service on port 8080. 
You can open a web browser at this port on your machine to view 
the web frontend for Airflow, as shown in Figure 5.3. 

Airflow comes preloaded witli a number of example DAGs. For our 
model pipclinc we’ll create a new DAG and then notify Airflow of 
the update. We’ll create a file called sklearn.py witli the following 
DAG definition: 


from airflow import DAG 

from airflow.operators.bash_operator import BashOperator 
from datetime import datetime, timedelta 

default_args = { 

'owner' : 'Airflow' , 

'depends_on_past' : False, 
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'email': 'bgweber@gmail.com 1 , 

'start_date' : datetime (2019, 11, 1), 

'email_on_failure' : True, 

} 

dag = DAG('games', default_args=default_args, 
schedule_interval="* * * * *") 

tl = BashOperator ( 

task_id= 'sklearn_pipeline' , 

bash_command= 'sudo docker run sklearn_pipeline' , 
dag=dag) 


There’s a few steps in this Python script to call out. The script 
uses a Bash operator to define the action to perform. The Bash 
operator is defined as the last step in the script, which specihes 
the operation to perform. The DAG is instantiated with a number 
of input arguments tliat define the workflow settings, such as who 
to email when the task fails. A cron expressiori is passed to the 
DAG object to define the schedule for the task, and the DAG 
object is passed to the Bash operator to associate the task with 
this graph of operations. 

Before adding the DAG to airflow, it’s useful to check for syntax 
errors in your code. We can run the following command from the 
tcrminal to check for issues with the DAG: 

python3 sklearn.py 

This command will not run the DAG, but will flag any syntax 
errors present in the script. To update Airflow with the new DAG 
file, run the following command: 


airflow list_dags 
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FIGURE 5.4: The sklearn DAG scheduled on Airflow. 


DAGS 


games 


This command will add the DAG to the list of workflows in Airflow. 
To view the list of DAGs, navigate to the Airflow web server, as 
shown in Figure 5.4. The web server will show the schednle of the 
DAG, and provide a history of past runs of the workflow. To check 
that the DAG is actually working, browse to the BigQuery UI and 
check for fresh model outputs. 

We now liave an Airflow Service np and rnnning that we can use 
to monitor the execution of our workflows. This setup enables us 
to track the execution of workflows, backfill any gaps in data sets, 
and enable alerting for critical workflows. 

Airflow supports a variety of operations, and many companies au- 
thor custom operators for internal usage. In our first DAG, we used 
the Bash operator to define the task to execute, but other options 
are available for running Docker images, including the Docker op¬ 
erator. The code snippet below shows how to change our DAG to 
use the Docker operator instead of the Bash operator. 


from airflow.operators.docker_operator import DockerOperator 


tl = DockerOperator ( 

task_id='sklearn_pipeline' , 






5.3 Workflow Tools 


125 


image='sklearn_pi peli ne' , 
dag=dag) 


The DAG we defined does not have any dependencies, since the 
Container performs ali of the steps in the model pipcline. If we 
had a dependency, such as running a sklearn_etl Container before 
running the model pipcline, we can use the set_upstrean command 
as shown below. Tliis configuration sets up two tasks, where the 
pipeline task will execute after the etl task completes. 

tl = BashOperator ( 

task_id= 'sklearn_etl 1 , 

bash_command= 1 sudo docker run sklearn_etl' , 
dag=dag) 


t2 = BashOperator ( 

task_id= 'sklearn_pipeline' , 

bash_command= 'sudo docker run sklearn_pipeline 1 , 
dag=dag) 


t2.set_upstream (tl) 


Airflow provides a rich set of functionality and we’ve only touched 
the surface of what the tool provides. While we were already able 
to schedule the model pipeline with hosted and managed cloud 
offerings, it’s useful to schedule the task through Airflow for irn- 
proved monitoring and versioning. The landscape of workflow tools 
will change over time, but many of the concepts of Airflow will 
translate to these new tools. 

5.3.2 Managed Airflow 

We now have a workflow tool set up for managing model workflows, 
but the default configuration does not provide high-availability, 
where new machines are provisioned when failures occur. While it’s 
possible to set up a distributed version of Airflow using Celery to 
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set up different Scheduler and Worker nodes, one of the recent trends 
is using Kubernetes to create more robust Airflow deployments. 

It is possible to self-host Airflow on Kubernetes, but it can be 
complex to set up. There are also fully-managed versions of Airflow 
available for cloud platforms such as Cloud Composer on GCP. 
With a managed version of Airflow, you define the DAGs to execute 
and set the schedules, and the platform is responsible for providing 
a high-availability deployment. 

To run our DAG on Cloud Composer, we’ll need to update the 
task to use a gke Pod operator in place of a Docker operator, be- 
cause Composer needs to be able to authenticate with Container 
Registry. The updated DAG is shown in the snippet below. 


from airflow.gcp.operators.kubernetes_engine import GKEPodOperator 

tl = GKEPodOperator ( 

task_id= 'sklearn_pipeline' , 
project_id = '{your_project_id} 1 , 

cluster_name = ' us-centrall-models-13d59d5b-gke' , 
name = ' sklearn_pipeline 1 , 
namespace='default', 
location= 1 us-centrall-c' , 

image= 'us.gcr.io/{your_proj ect_id}/sklearn_pipeline' , 
dag=dag) 


Cloud Composer is built on top of GKE. A beta version was re- 
leased in 2018 and many features of the tool are stili evolving. 
To get started with Composer, perform the following steps in the 
GCP Console: 

1. Browse to Cloud Composer 2 

2. Click “Create” 

3. Assign a name, “models” 

4. Select Python 3 

o 

https://console.cloud.google.com/composer/ 
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FIGURE 5.5: Fully-managed Airflow on Cloud Composer. 

5. Select The most recent image version 

6. Click “Create” 

Like GKE, cluster setup takes quite a while to complete. Once 
setup completes, you can access the Airflow web Service by clicking 
ou “Airflow” in the list of clusters, as shown in Figure 5.5. 

Before adding our DAG, we’ll need to npdate the ciuster_name at¬ 
tribute to point to the provisioned GKE cluster. To find the clus¬ 
ter name, click on the Composer cluster and find the GKE cluster 
attribute. To add the DAG, click on “DAGs” in the cluster list, 
which will direct your browser to a google storage bucket. After 
uploading the DAG to this bucket using the upload UI, the Air¬ 
flow cluster will automatically detect the new DAG and add it 
to the list of workflows managed by this cluster. We now have a 
high-availability Airflow deployment running our model workflow. 


5.4 Conclusion 

In this chapter we explored a batch model pipcline for applying a 
machine learning model to a set of users and storing the results to 
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BigQuery. To make the pipeline portable, so that we can execute 
it in different environments, we created a Docker image to define 
the reqnired libraries and credentials for the pipeline. We then 
ran the pipeline on an EC2 instance using batch commands, cron, 
and Airflow. We also used GKE and Cloud Composer to run the 
Container via Kubernetes. 

Workfiow tools can be tedious to set up, especially when installing 
a cluster deployment, but they provide a number of benehts over 
manual approaches. One of the key benehts is the ability to han- 
dle DAG conhguration as code, which enables code reviews and 
version control for workhows. It’s usefnl to get experience with 
conhguration as code, because more and more projects are using 
this approach. 
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PySpark for Batch Pipelines 


Spark is a general-purpose computing framework that can scale 
to massive data volumes. It builds upon prior big data tools such 
as Hadoop and MapReduce, while providing significant improve- 
ments in the expressivity of the languages it supports. One of the 
core components of Spark is resilient distributed datasets (RDD), 
which enable clusters of machines to per for m workloads in a coordi- 
nated, and fault-tolerant manner. In more recent versions of Spark, 
the Dataframe API provides an abstraction on top of RDDs that 
resembles the sanie data structure in R and Pandas. PySpark is 
the Python interface to Spark, and it provides an API for working 
with large-scale datasets in a distributed computing environment. 

PySpark is an extremely valuable tool for data scientists, because 
it can streamline the process for translating prototype models into 
production-grade model workflows. At Zynga, our data Science 
team owns a number of production-grade Systems that provide 
useful signals to our game and marketing teams. By using PyS¬ 
park, we’ve been able to reduce the amount of support we need 
from engineering teams to scale up models from concept to pro- 
duction. 

Up until now in this book, all of the models we’ve built and de- 
ployed have been targeted at single machines. While we are able 
to scale up model serving to multiple machines using Lambda, 
ECS, and GKS, these containers worked in isolation and there was 
no coordination among nodes in these environments. With PyS¬ 
park, we can build model workflows that are designed to operate 
in cluster environments for both model training and model serving. 
The resuit is that data scientists can now tackle much larger-scale 
problems than previously possible using prior Python tools. PyS- 


129 





130 


6 PySpark for Batch Pipelines 


park provides a nice tradeoff between an expressive programming 
language and APIs to Spark versus more legacy options such as 
MapReduce. A general trend is that the use of Hadoop is drop- 
ping as more data Science and engineering teams are switching to 
Spark ecosystems. In Chapter 7 we’ll explore another distributed 
computing ecosystem for data Science called Cloud Dataflow, but 
for now Spark is the open-source leader in this space. PySpark was 
one of the main motivations for me to switch from R to Python 
for data Science workflows. 

The goal of this chapter is to provide an introduction to PySpark 
for Python programmers that shows how to build large-scale modcl 
pipelines for batch scoring applications, where you may have bil- 
lions of records and millions of users that need to be scored. While 
production-grade Systems will typically push results to application 
databases, in this chapter we’ll focus on batch processes that pull 
in data from a lake and push results back to the data lake for other 
systems to use. We’ll explore pipelines that perform model appli¬ 
cations for botli AWS and GCP. While the data sets used in this 
chapter rely on AWS and GCP for storage, the Spark environment 
does not have to run on either of these platforms and instead can 
run on Azure, other clouds, or on-pem Spark clusters. 

We’ll cover a variety of different topics in this chapter to show 
different use cases of PySpark for scalable model pipelines. After 
showing how to rnake data available to Spark on S3, we’ll cover 
sorne of the basies of PySpark focusing on dataframe operations. 
Next, we’ll build out a predictive model pipeline that reads in 
data from S3, performs batch model predictions, and then writes 
the results to S3. We’ll follow this by showing off how a newer 
feature called Pandas UDFs can be used with PySpark to perform 
distributed deep learning and feature engineering. To conclude, 
we’ll build another batch model pipeline now using GCP and then 
discuss how to productize workflows in a Spark ecosystem. 


6.1 Spark Environments 
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6.1 Spark Environments 

There’s a variety of ways to both configure Spark clusters and 
submit commands to a cluster for execution. When getting started 
witli PySpark as a data scientist, my recommendation is to use a 
freely-available notebook environment for getting up and running 
with Spark as quick as possible. Wliile PySpark may not perform 
quite as well as Java or Scala for large-scale workflows, the ease of 
development in an interactive programming environment is worth 
the trade-off. 

Based on your organization, you may be starting from scratch for 
Spark or using an existing solution. Here are the types of Spark 
deployments Pve seen in practice: 

• Self Hosted: An engineering team manages a set of clusters and 
provides console and notebook access. 

• Cloud Solutions: AWS provides a managed Spark option called 
EMR and GCP has Cloud Dataproc. 

• Vendor Solutions: Databricks, Cloudera, and other vendors 
provide fully-managed Spark environments. 

There’s a number of different factors to consider when choosing a 
Spark ecosystem, including cost, scalability, and feature sets. As 
you scale the size of the team using Spark, additional consider- 
ations are whether an ecosystem supports multi-tenancy, where 
multiple jobs can run concurrently on the same cluster, and isola- 
tion where one job failing should not kill other jobs. Self-hosted 
Solutions require significant engineering work to support these ad¬ 
ditional consider at ions, so many organizations use cloud or vendor 
Solutions for Spark. In this book, we’ll use the Databricks Com- 
munity Edition, which provides ali of the baseline features needed 
for learning Spark in a collaborative notebook environment. 

Spark is a rapidly evolving ecosystem, and it’s difficult to author 
books about this subject that do not quickly becorne out of date 
as the platform evolves. Another issue is that many books target 
Scala rather than Python for the majority of coding examples. My 
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advice for readers that want to dig deeper into the Spark ecosystem 
is to explore books based on the broader Spark ecosystem, such 
as (Karau et al., 2015). You’11 likely need to read through Scala or 
Java code examples, but the majority of content covered will be 
relevant to PySpark. 

6.1.1 Spark Clusters 

A Spark environment is a cluster of machines with a single driver 
node and zero or more worker nodes. The driver machine is the 
master node in the clnster and is responsible for coordinating the 
workloads to per for m. In gener al, workloads will be distributed 
across the worker nodes when performing operations on Spark 
dataframes. However, when working with Python objects, such 
as lists or dictionaries, objects will be instantiated on the driver 
node. 

Ideally, you want all of your workloads to be operating on worker 
nodes, so that the execution of the steps to perform is distributed 
across the cluster, and not bottlenecked by the driver node. How¬ 
ever, there are some types of operations in PySpark where the 
driver has to perform all of the work. The most common situa- 
tion where this happens is when using Pandas dataframes in your 
workloads. When you use toPandas or other commands to convert a 
data set to a Pandas object, all of the data is loaded into memory 
on the driver node, which can crasli the driver node when working 
with large data sets. 

In PySpark, the majority of commands are lazily executed, mean- 
ing that an operation is not performed until an output is explic- 
itly needed. For example, a join operation between two Spark 
dataframes will not immediately cause the join operation to be per¬ 
formed, which is how Pandas works. Instead, the join is performed 
once an output is added to the chain of operations to perform, 
such as displaying a sample of the resulting dataframe. One of 
the key differences between Pandas operations, where operations 
are eagerly performed and pulled into memory, is that PySpark 
operations are lazily performed and not pulled into memory until 
needed. One of the benehts of this approach is that the graph of 
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operations to perform can be optimized before being sent to the 
cluster to execute. 

In general, nodes in a Spark cluster should be considered 
ephemeral, because a cluster can be resized during execution. Addi- 
tionally, some vendors may spin up a new cluster when scheduling 
a job to run. This means that common operations in Python, such 
as saving files to disk, do not map directly to PySpark. Instead, 
using a distributed computing environment means that you need 
to use a persistent file store such as S3 when saving data. This 
is important for logging, because a worker node may crash and 
it may not be possible to ssh into the node for debugging. Most 
Spark deployments have a logging systern set up to help with this 
issue, but it’s good practice to log workflow status to persistent 
storage. 

6.1.2 Databricks Community Edition 

One of the quickest ways to get up and running with PySpark is 
to use a hosted notebook environment. Databricks is the largest 
Spark vendor and provides a free version for getting started called 
Community Edition 1 . Wc’ll use this environment to get started 
with Spark and build AWS and GCP model pipelines. 

The first step is to create a login on the Databricks website for the 
community edition. Next, perform the following steps to spin up 
a test cluster after logging in: 

1. Click on “Clusters” on the left navigation bar 

2. Click “Create Cluster” 

3. Assign a name, “DSP” 

4. Select the most recent runtime (non-beta) 

5. Click “Create Cluster” 

After a few minutes we’ll have a cluster set up that we can use 
for submitting Spark commands. Before attaching a notebook to 
the cluster, we’ll hrst set up the libraries that we’ll use throughout 


1 https://community.cloud.databricks.com/ 
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this chapter. Instead of using pip to install libraries, we’ll use the 
Databricks UI, which makes sure that every node in the cluster 
has the same set of libraries installcd. We’ll use both Maven and 
PyPI to install libraries on the cluster. To install the BigQuery 
connector, perform the following steps: 

1. Click on “Clusters” on the lcft navigation bar 

2. Select the “DSP” cluster 

3. Click on the “Libraries” tab 

4. Select “Install New” 

5. Click on the “Maven” tab. 

6. Set COOrdiliateS to com.spotify:spark-bigquery_2.11:0.2.2 

7. Click install 

The UI will then show the status as resolving, and then installing, 
and then installcd. We also need to attach a few Python libraries 
that are not pre-installed on a new Databricks cluster. Standard 
libraries such as Pandas are installed, but you might need to up- 
grade to a more recent version since the libraries pre-installed by 
Databricks can lag signihcantly. 

To install a Python library on Databricks, perform the same steps 
as before up to step 5. Next, instead of selecting “Maven” choose 
“PyPI”. Under Package, specify the package you want to install and 
then click “Install”. To follow along with ali of the sections in this 
chapter, you’ll need to install the following Python packages: 

• koalas - for dataframe conversion. 

• featuretools - for feature generation. 

• tensorflow - for a deep learning backend. 

• keras - for a deep learning model. 

You’ll now have a cluster set up capable of performing distributed 
feature engineering and deep learning. We’ll start with basic Spark 
commands, show off newer functionality such as the Koalas library, 
and then dig into these more advanced topics. After this setup, 
your cluster library setup should look like Figure 6.1. To ensure 
that everything is set up successfully, restart the cluster and check 
the status of the installed libraries. 
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FIGURE 6.1: Libraries attached to a Databricks cluster. 


Now that we have provisioned a cluster and set up the required 
libraries, we can create a notebook to start submitting commands 
to the cluster. To create a new notebook, perform the following 
steps: 

1. Click on “Databricks” on the left navigation bar 

2. Under “Common Tasks”, select “New Notebook” 

3. Assign a nante “CH6” 

4. Select “Python” as the language 

5. select “DSP” as the cluster 

6. Click “Create” 

The resuit will be a notebook environment where you can start 
running Python and PySpark commands, such as print("Helio 
World!"). An example notebook running this command is shown 
in Figure 6.2. We now have a PySpark environment up and run¬ 
ning that we can use to build distributed model pipelines. 
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CH6 (Python) 
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1 print( "Helio World!") 

Helio World! 
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FIGURE 6.2: Running a Python command in Databricks. 


6.2 Staging Data 

Data is essential for PySpark workflows. Spark supports a variety 
of methods for reading in data sets, inclnding connecting to data 
lakes and data warehouses, as well as loading sample data sets frorn 
libraries, such as the Boston housing data set. Since the theme of 
this book is building scalable pipelines, we’ll focus on using data 
layers that work with distributed workflows. To get started with 
PySpark, we’ll stage input data for a model pipcline on S3, and 
then read in the data set as a Spark dataframe. 

This section will show how to stage data to S3, set up credentials 
for accessing the data from Spark, and fetching the data frorn S3 
into a Spark dataframe. The hrst step is to set up a bucket on S3 
for storing the data set we want to load. To perform this step, run 
the following operations on the command line. 


aws s3api create-bucket --bucket dsp-ch6 --region us-east-1 
aws s3 Is 


After running the command to create a new bucket, we use the is 
command to verify that the bucket has been successfully created. 
Next, we’ll download the games data set to the EC2 instance and 
then move the file to S3 using the cp command, as shown in the 
snippet below. 
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wget https://github. com/bgweber/Twitch/raw/master/ 

Recommendations/games-expand.csv 
aws s3 cp games-expand.csv s3: //dsp-ch6/csv/games-expand .csv 


In addition to staging the games data set to S3, we’ll also copy a 
subset of the CSV files from the Kagglc NHL data set, which we 
set up in Section 1.5.2. Run the following commands to stage the 
plays and stats CSV hies from the NHL data set to S3. 


aws s3 cp game_plays.csv s3: //dsp-ch6/csv/game_plays .csv 
aws s3 cp game_skater_stats.csv 

s3: //dsp~ch6/csv/game_skater_stats .csv 
aws s3 Is s3: //dsp-ch6/csv/ 


We now liave ali of the data sets needed for the code examples 
in this chapter. In order to read in these data sets from Spark, 
we’ll need to set up S3 credentials for interacting with S3 from the 
Spark cluster. 

6.2.1 S3 Credentials 

For production environments, it is better to use IAM roles to man- 
age access instead of using access keys. However, the community 
edition of Databricks constrains how much conhguration is allowed, 
so we’ll use access keys to get up and running with the examples 
in this chapter. We already set up a user for accessing S3 from an 
EC2 instance. To create a set of credentials for accessing S3 pro- 
grammatically, perform the following steps from the AWS console: 

1. Search for and select “IAM” 

2. Click on “Users” 

3. Select the user created in Section 3.3.2, “S3_Lambda” 

4. Click “Security Credentials” 

5. Click “Create Access Key” 

The resuit will be an access key and a secret key enabling access 
to S3. Save these values in a secure location, as we’ll use them in 
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the notebook to connect to the data sets on S3. Once you are done 
with this chapter, it is recommended to revoke these credentials. 

Now that we have credentials set up for access, we can return to 
the Databricks notebook to read in the data set. To enable access 
to S3, we need to set the access key and secret key in the Hadoop 
conhguration of the cluster. To set these keys, run the PySpark 
connnands shown in the snippet below. You’11 need to replace the 
access and secret keys with the credentials we just created for the 
S3_l_ambda role. 

AWS_ACCESS_KEY = "AK..." 

AWS_SECRET_KEY = "dC..." 


sc._j sc.hadoopConfiguration () .set ( 

"fs.s3n.awsAccessKeyld" , AWS_ACCESS_KEY) 
sc._j sc.hadoopConfiguration () .set ( 

"fs.s3n.awsSecretAccessKey" , AWS_SECRET_KEY) 

We can now read the data set into a Spark dataframe using the read 
command, as shown below. This command uses the spark context 
to issue a read command and reads in the data set using the CSV 
input reader. We also specify that the CSV file includes a lieader 
row and that we want Spark to infer the data types for the colmnns. 
When reading in CSV hies, Spark eagerly fetches the data set into 
memory, wliich can cause issues for larger data sets. When working 
with large CSV hies, it’s a best practice to split up large data sets 
into multiple hies and then read in the hies using a wildcard in 
the input path. When using other hle formats, such as Parquet or 
Avro, Spark lazily fetches the data sets. 


games_df = spark.read. csv("s3: //dsp-ch6/csv/games-expand.csv" , 

header=True, inferSchema = True) 


display (games_df) 


The di splay command in the snippet above is a utility function pro- 
vided by Databricks that samples the input dataframe and shows 
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► (3) Spark Jobs 

► i games_df pyspark.sql dataframe.DataFrame = [G1 integer G2 integer 9 more fields] 


G1 

G2 

G3 

G4 ' G5 

G6 

G7 

G8 

G9 

G10 

label 

0 

0 

0 

1 0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 1 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 0 

1 

1 

0 

0 

1 

1 

0 

0 

1 

0 1 

1 

0 

1 

1 

0 

1 

1 

0 

1 

0 1 

0 

0 

0 

0 

0 

0 


Showing the first 1000 rows. 

m .all - i - 

Command took 11.87 seconds — by bweber@soe.ucsc.edu at 11/29/2019, 5:02:46 PM on DSP 


FIGURE 6.3: Displaying the dataframe in Databricks. 


a table representation of the frame, as shown in Figure 6.3. It is 
similar to the head function in Pandas, but provides additional 
functionality such as transforming the sampled dataframe into a 
plot. We’ll explore the plotting functionality in Section 6.3.3. 

Now that we have data loaded into a Spark dataframe, we can be- 
gin exploring the PySpark language, which enables data scientists 
to build production-grade model pipelines. 


6.3 A PySpark Primer 

PySpark is a powcrful language for both exploratory analysis and 
building machine learning pipelines. The core data type in PyS- 
park is the Spark dataframe, which is similar to Pandas dataframes, 
but is designed to execute in a distributed environment. While the 
Spark Dataframe API does provide a familiar interface for Python 
programmers, there are significant differences in the way that com- 
mands issued to these objects are executed. A key difference is that 
Spark commands are lazily executed, which means that commands 
such as iloc are not available on these objects. While working witli 
Spark dataframes can seem constraining, the benefit is that PyS- 
park can scale to much larger data sets than Pandas. 
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This section will walk through common operations for Spark 
dataframes, including persisting data, converting between different 
dataframe types, transforming dataframes, and using user-defined 
functions. We’ll use the NHL stats data set, which provides user- 
level summaries of player performance for each game. To load this 
data set as a Spark dataframe, run the commands in the snippet 
below. 


stats_df = spark. read.csv("s3: //dsp-ch6/csv/game_skater_stats.csv" , 

header=True, inferSchema = True) 


display(stats_df) 


6.3.1 Persisting Dataframes 

A common operation in PySpark is saving a dataframe to persis¬ 
tent storage, or reading in a data set frorn a storage layer. While 
PySpark can work with databases such as Redshift, it performs 
much better when using distributed file stores such as S3 or GCS. 
In this chapter we’ll use these types of storage layers as the outputs 
of model pipelines, but it’s also useful to stage data to S3 as inter- 
mediate steps within a workflow. For example, in the AutoModel 2 
systern at Zynga, we stage the output of the feature generation 
step to S3 before using MLlib to train and apply machine learning 
models. 

The data storage layer to use will depend on your cloud platform. 
For AWS, S3 works well with Spark for distributed data reads and 
writes. When using S3 or other data lakes, Spark supports a variety 
of different file formats for persisting data. Parquet is typically the 
industry Standard when working with Spark, but we’ll also explore 
Avro and ORC in addition to CSV. Avro is a better format for 
streaming data pipelines, and ORC is useful when working with 
legacy data pipelines. 

To sliow the range of data formats supported by Spark, we’ll take 
the stats data set and write it to Avro, then Parquet, then ORC, 

o 

https://www.gamasutra.com/blogs/BenWeber/20190426/340293/ 
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and finally CSV. After performing this round trip of data IO, we’ll 
end up with our initial Spark dataframe. To start, wc’ll save the 
stats dataframe in Avro format, using the code snippet shown be- 
low. This code writes the dataframe to S3 in Avro format using 
the Databricks Avro writer, and then reads in the results using 
the same library. The resuit of performing these steps is that we 
now liave a Spark dataframe pointing to the Avro hies on S3. Since 
PySpark lazily evaluates operations, the Avro hies are not pulled 
to the Spark cluster until an output needs to be created from this 
data set. 


# AVRO write 

avro_path = "s3://dsp-ch6/avro/game_skater_stats/" 
stats_df.write.mode ( 'overwrite 1 ) .format ( 

"com.databricks.spark.avro"). save (avro_path) 


# AVRO read 

avro_df = sqlContext.read.format ( 

"com.databricks.spark.avro") .load (avro_path) 


Avro is a distributed hle format that is record based, while the 
Parquet and ORC formats are column based. It is useful for the 
streaming workhows that we’ll explore in Chapter 8, because it 
compresses records for distributed data processing. The output of 
saving the stats dataframe in Avro format is shown in the snip¬ 
pet below, which shows a subset of the status hies and data hies 
generated when persisting a dataframe to S3 as Avro. Like most 
scalable data formats, Avro will write records to several hies, based 
on partitions if specihed, in order to enable efficient read and write 
operations. 


aws s3 Is s3: //dsp-ch6/avro/game_skater_stats/ 

2019-11-27 23:02:43 1455 _committed_1588617578250853157 

2019-11-27 22:36:31 1455 _committed_1600779730937880795 

2019-11-27 23:02:40 0 _started_1588617578250853157 

0 started 6942074136190838586 


2019-11-27 23:31:42 
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2019-11-27 23:31:47 1486327 part-00000-tid-6942074136190838586- 

c6806d0e-9e3d-40fc-b212-61c3d45clbc3-15-l-c000.avro 
2019-11-27 23:31:43 44514 part-00007-tid-6942074136190838586- 

c6806d0e-9e3d-40fc-b212-61c3d45clbc3-22-l-c000.avro 


Parquet on S3 is currently the Standard approach for building 
data lakes on AWS, and tools such as Delta Lake are leveraging 
this format to provide highly-scalable data platforms. Parquet is 
a columnar-oriented file format that is designed for efficient reads 
when only a subset of columns are being accessed for an operation, 
such as when using Spark SQL. Parquet is a native format for 
Spark, which means that PySpark has built-in functions for botli 
reading and writing hies in this format. 

An example of writing the stats dataframe as Parquet hies and 
reading in the resuit as a new dataframe is shown in the snippet 
below. In this example, we haven’t set a partition key, but as with 
Avro, the dataframe will be split up into multiple hies in order 
to support highly-performant read and write operations. When 
working with large-scale data sets, it’s useful to set partition keys 
for the hle export using the repartition function. After this section, 
we’ll use Parquet as the primary hle format when working with 
Spark. 

# parquet out 

parquet_path = "s3a://dsp-ch6/games-parquet/" 
avro_df.write.mode ( 1 overwrite' ). parquet (parquet_path) 


# parquet in 

parquet_df = sqlContext.read. parquet(parquet_path) 


ORC is a another columnar format that works well with Spark. 
The main benefit over Parquet is that it can support improved 
compression, at the cost of additional compute cost. I ni including 
it in this chapter, because some legacy systems stili use this format. 
An example of writing the stats dataframe to ORC and reading the 
results back into a Spark dataframe is shown in the snippet below. 
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Like the Avro format, the ORC write command will distribute the 
dataframe to multiple files based on the size. 

# orc out 

orc_path = "s3a://dsp-ch6/games-orc/" 
parquet_df.wri te.mode (' overwri te 1 ) .orc(orc_path) 

# orc "in 

orc_df = sqlContext.read.orc (orc_path) 


To complete our round trip of file formats, we’ll write the results 
back to S3 in the CSV format. To make sure that we write a single 
file rather than a batch of hies, we’ll use the coalesce command 
to collcct the data to a single node before exporting it.This is a 
command that will fail with large data sets, and in general it’s 
best to avoid using the CSV format when using Spark. However, 
CSV hies are stili a comnion format for sharing data, so it’s useful 
to understand how to export to this format. 

# CSV out 

csv_path = "s3a://dsp-ch6/games-csv~out/" 

orc_df.coalesce (1) .write.mode ( 'overwrite' ) .format ( 

"com .databricks.spark.csv").option("header","true").save(csv_path) 


# and CSV read to finish the round trip 

csv_df = spark.read. csv(csv_path, header=True, inferSchema = True) 


The resulting dataframe is the same as the dataframe that we hrst 
read in frorn S3, but if the data types are not trivial to infer, then 
the CSV format can cause problems. When persisting data with 
PySpark, it’s best to use hle formats that describe the schema of 
the data being persisted. 

6.3.2 Converting Dataframes 

While it’s best to work with Spark dataframes when authoring 
PySpark workloads, it’s often necessary to translate between dif- 
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ferent formats based on your use case. For example, you might 
need to perform a Pandas operation, such as selecting a specific 
element from a dataframe. When this is required, you can use the 
toPandas function to pull a Spark dataframe into memory on the 
driver node. The PySpark snippet below shows how to perform this 
task, display the results, and then convert the Pandas dataframe 
back to a Spark dataframe. In general, it’s best to avoid Pandas 
when authoring PySpark workflows, because it prevents distribu- 
tion and scale, but it’s often the best way of expressing a command 
to execute. 


stats_pd = stats_df.toPandas () 


stats_df = sqlContext.createDataFrame (stats_pd) 


To bridge the gap between Pandas and Spark dataframes, 
Databricks introduced a new library called Koalas that resembles 
the Pandas API for Spark-backed dataframes. The resuit is that 
you can author Python code that works witli Pandas commands 
that can scale to Spark-scale data sets. An example of converting 
a Spark dataframe to Koalas and back to Spark is shown in the 
following snippet. After converting the stats dataframe to Koalas, 
the snippet shows how to calculate the average time on ice as well 
as index into the Koalas frame. The intent of Koalas is to provide 
a Pandas interface to Spark dataframes, and as the Koalas library 
matures more Python modules may be able to take advantage of 
Spark. The output from the snippet shows that the average time 
on ice was 993 seconds per game for NHL players. 


import databricks.koalas as ks 

stats_ks = stats_df.to_koalas () 
stats_df = stats_ks.to_spark( ) 

pr _ int(stats_ks['t _ i meOnlce ' ] . mean ()) 
print (stats_ks.iloc[:1, 1:2]) 
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During the development of this book, Koalas is stili preliminary 
and only partially implemented, but it strives to provide a familiar 
interface for Python coders. Both Pandas and Spark dataframes 
can work with Koalas, and the snippet below shows how to go 
from Spark to Koalas to Pandas to Spark, and Spark to Pandas 
to Koalas to Spark. 

# spark -> koalas -> pandas -> spark 

df = sqlContext.createDataFrame(stats_df.to_koalas () .toPandas()) 

# spark -> pandas -> koalas -> spark 

df = ks.from_pandas(stats_df.toPandas ()) .to_spark( ) 

In general, you’ll be working with Spark dataframes when author- 
ing code in a PySpark environment. However, it’s usefnl to be able 
to work with different object types as necessary to bnild model 
workflows. Koalas and Pandas UDFs provide powerfnl tools for 
porting workloads to large-scale data ecosystems. 

6.3.3 Transforming Data 

The PySpark Dataframe API provides a variety of usefnl functions 
for aggregating, hltering, pivoting, and summarizing data. While 
some of this functionality rnaps well to Pandas operations, my 
recommendation for quickly getting up and running with munging 
data in PySpark is to use the SQL interface to dataframes in Spark 
called Spark SQL. If you’re already using the pandasql or framequery 
libraries, then Spark SQL should provide a familiar interface. If 
you’re new to these libraries, then the SQL interface stili provides 
an approachable way of working with the Spark ecosystem. We’ll 
cover the Dataframe API later in this section, but Lrst start with 
the SQL interface to get up and running. 

Exploratory data analysis (EDA) is one of the key steps in a data 
Science workflow for understanding the shape of a data set. To 
work through this process in PySpark, we’ll load the stats data set 
into a dataframe, expose it as a view, and then calculate summary 
statistics. The snippet below shows how to load the NHL stats data 
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playerjd 

games 

■v goals 

8471214 

788 

434 

8474564 

655 

342 

8474141 

748 

311 

8475166 

700 

308 

8470794 

782 

305 


FIGURE 6.4: Summarizing player activity. 


set, expose it as a view to Spark, and then run a query against the 
dataframe. The aggregated dataframe is then visualized using the 
dispiay command in Databricks. 

stats_df = spark. read.csv("s3: //dsp-ch6/csv/game_skater_stats.csv" , 

header=True, inferSchema = True) 
stats_df.createOrReplaceTempVi ew( "stats" ) 

new_df = spark. sql(. 

select player_id, sum(l) as games, sum(goals) as goals 

from stats 

group by 1 

order by 3 desc 

limit 5 

II II II ^ 

display (new_df) 


An output of this code block is shown in Figure 6.4. It shows 
the highest scoring players in the NHL dataset by ranking the 
resnlts based on the total number of goals. One of the powerfnl 
features of Spark is that the SQL query will not operate against the 
dataframe until a resuit set is needed. This means that commands 
in a notebook can set up multiple data transformation steps for 
Spark dataframes, whicli are not performed until a later step needs 
to execute the graph of operations dehned in a code block. 

Spark SQL is expressive, fast, and my go-to method for working 
with big data sets in a Spark environment. Whilc prior Spark ver- 
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sions performed better with the Dataframe API versus Spark SQL, 
the difference in performance is now trivial and you should use the 
transformation tools that provide the best iteration speed for work- 
ing with large data sets. With Spark SQL, you can join dataframes, 
run nested queries, set up temp tables, and mix expressive Spark 
operations with SQL operations. For example, if you want to look 
at the distribution of goals versus shots in the NHL stats data, you 
can run the following command on the dataframe. 


display(spark.sql(. 

select cast(goals/shots * 50 as int)/50.0 as Goals_per_shot 
,sum(l) as Players 
f rom ( 

select player_id, sum(shots) as shots, sum(goals) as goals 
from stats 
group by 1 
having goals >= 5 

) 

group by 1 
order by 1 
.)) 

This query restricts the ratio of goals to shots to players with more 
than 5 goals, to prevent outliers such as goalies scoring during 
power plays. We’ll use the d-ispiay command to output the resuit 
set as a table and then use Databricks to display the output as a 
chart. Many Spark ecosystems have ways of visualizing results, and 
the Databricks environment provides this capability through the 
display command, which works well with both tabular and pivot 
table data. After running the above command, you can click on 
the chart icon and choose dimensions and measures which show 
the distribution of goals versus shots, as visualized in Figure 6.5. 

While I’m an advocate of using SQL to transform data, since it 
scales to different programming environments, it’s useful to get 
familiar with some of the basic dataframe operations in PySpark. 
The code snippet below shows how to perform cornmon operations 
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0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.32 

Goals_per_shot 


3 .iill ^ Plot Options 


FIGURE 6.5: Distributiori of goals per shot. 


including dropping columns, selecting a subset of columns, and 
adding new columns to a Spark dataframe. Like prior commands, 
ali of these operations are lazily performed. There are some syn- 
tax differences from Pandas, but the general commands used for 
transforming data sets should be familiar. 


from pyspark.sql.functions import lit 

# dropping columns 

copy_df = stats_df.drop( 'game_id' , 'player_id') 

# selection columns 

copy_df = copy_df.select (' assists' , 'goals', 'shots' ) 

# adding columns 

copy_df = copy_df.withColumn ( "league" , lit ( 'NHL' )) 
display (copy_df) 


One of the common operations to perform between dataframes is a 
join. This is easy to express in a Spark SQL query, but sometimes 
it is preferable to do this programmatically with the Dataframe 
API. The code snippet below shows how to join two dataframes 
together, when joining on the game_id and player_id fields. The 
league colu mn which is a literal will be joined with the rest of the 
stats dataframe. This is a trivial example where we are adding a 
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gamejd 

player_id 

league 

team_id 

timeOnice 

assists 

goals 

shots 

hits 

powerPlayGoals 

powerPlayAssists 

2011030221 

8467412 

NHL 

1 

999 

0 

0 

1 

3 

0 

0 

2011030221 

8468501 

NHL 

1 

1168 

0 

0 

0 

4 

0 

0 

2011030221 

8470609 

NHL 

1 

558 

0 

0 

2 

1 

0 

0 

2011030221 

8471816 

NHL 

1 

1134 

0 

0 

1 

4 

0 

0 

2011030221 

8472410 

NHL 

1 

436 

0 

0 

1 

3 

0 

0 

2011030221 

8471233 
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1 

1344 

0 

1 

4 

0 

1 

0 


FIGURE 6.6: The dataframe resulting from the join. 


new colmnn onto a small dataframe, but the join operation from 
the Dataframe API can scale to massive data sets. 


copy_df = stats_df.select ( 'game_id 1 , 1 player_id 1 ). 

withColumn ( "league" , lit('NHL')) 
df = copy_df.join (stats_df, ['game_id', 'player_id'] ) 
display (df) 


The resuit set from the join operation above is shown in Figure 
6.6. Spark supports a variety of different join types, and in tliis 
example we used an inner join to append the league column to the 
players stats dataframe. 

It’s also possible to perform aggregation operations on a dataframe, 
such as calculating sums and averages of columns. An example of 
computing the average time on ice for players in the stats data set, 
and total number of goals scored is shown in the snippet below. 
The groupBy command uses the player_id as the column for collaps- 
ing the data set, and the agg command specihes the aggregations 
to perform. 


summary_df = stats_df.groupBy ( "player_id" ). agg( 


display (summary_df) 


{ 'timeOnlce 1 : 1 avg 1 , 'goals':'sum'}) 


The snippet creates a dataframe with player_id, timeOnice, and 
goals columns. Wc’ll again use the plotting functionality in 
Databricks to visualize the results, but tliis time select the scatter 
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FIGURE 6.7: Time on ice and goal scoring plots. 


plot option. The resulting plot of goals versus time on ice is shown 
in Figure 6.7. 

We’ve worked through introductory examples to get up and run- 
ning with dataframes in PySpark, focusing on operations that are 
useful for munging data prior to training machine learning models. 
These types of operations, in combination with reading and writing 
dataframes provide a useful set of skills for performing exploratory 
analysis on massive data sets. 

6.3.4 Pandas UDFs 

While PySpark provides a great deal of functionality for work- 
ing with dataframes, it often lacks core functionality provided 
in Python libraries, such as the curve-fitting functions in SciPy. 
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While it’s possible to use the toPandas function to convert 
dataframes into the Pandas format for Python libraries, this ap- 
proach breaks when using large data sets. Pandas UDFs are a 
newer feature in PySpark that help data scientists work around 
this problem, by distributing the Pandas conversion across the 
worker nodes in a Spark cluster. Witli a Pandas UDF, you dehne 
a group by operation to partition the data set into dataframes 
that are small enough to fit into the mernory of worker nodes, and 
then autlior a function that takes in a Pandas dataframe as the 
input parameter and returns a transformed Pandas dataframe as 
the resuit. Behind the scenes, PySpark uses the PyArrow library 
to efficiently translate dataframes from Spark to Pandas and back 
frorn Pandas to Spark. This approach enables Python libraries, 
such as Keras, to be scaled up to a large cluster of machines. 

This section will walk through an example problem where we need 
to use an existing Python library and show how to translate the 
workflow into a scalable solution using Pandas UDFs. The question 
we are looking to answer is understanding if there is a positive or 
negative relationship between the shots and hits attributes in the 
stats data set. To calculate this relationship, we can use the leastsq 
function in SciPy, as shown in the snippet below. This example 

creates a Pandas dataframe for a single player_id, and then hts 

a simple linear regression between these attributes. The output 
is the coefficients used to fit the least-squares operation, and in 
this case the number of shots was not strongly correlated with the 
number of hits. 


sample_pd = spark. sql(. 

select * from stats 
where player_id = 8471214 
.).toPandas() 


# Import Python libraries 

from scipy.optimize import leastsq 

import numpy as np 
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# Define a function to fit 
def fit(params, x, y): 

return (y - (params[0] + x * params[l] )) 

# Fit the curve and show the results 

resuit = leastsq(fit, [1,0], args=(sample_pd.shots,sample_pd.hits)) 
print (resuit) 

Now we want to perform this operation for every player in the stats 
data set. To scale to tliis volume, we’ll hrst partition by piayer_id, 
as shown by the groupBy operation in the code snippet below. Next, 
We’ll run the anaiyze_piayer function for each of these partitioned 
data sets using the apply command. Wliile the stats_df dataframe 
used as input to this operation and the piayers_df dataframe re- 
turned are Spark dataframes, the sampled_pd dataframe and the 
dataframe returned by the analyze player function are Pandas. The 
Pandas UDF annotation provides a hint to PySpark for how to dis¬ 
tribute this workload so that it can scale the operation across the 
cluster of worker nodes rather than eagerly pulling all of the data 
to the driver node. Like rnost Spark operations, Pandas UDFs are 
lazily evaluated and will not be executed until an output value is 
needed. 

Our initial example now translated to use Pandas UDFs is shown 
below. After dehning additional modules to include, we specify the 
schema of the dataframe that will be returned from the operation. 
The schema object dehnes the structure of the Spark dataframe 
that will be returned from applying the analyze player function. 
The next step in the code block lists an annotation that dehnes 
this function as a grouped map operation, which means that it 
works on dataframes rather than scalar values. As before, we’ll 
use the leastsq function to £t the shots and hits attributes. After 
calculating the coefhcients for this curve htting, we create a new 
Pandas dataframe with the player id, and regression coefhcients. 
The display command at the end of this code block will force the 
Pandas UDF to execute, which will create a partition for each of 
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the players in the data set, apply the least squares operation, and 
merge the results back together into a Spark dataframe. 

# Load necessary libraries 

from pyspark.sql.functions import pandas_udf, PandasUDFType 
from pyspark.sql.types import * 
import pandas as pd 

# Create the schema for the resulting dataframe 

schema = StructType ( [StructFi eld ( ' ID ' , LongTypeQ, True), 

StructField ( 'p0' , DoubleType (), True), 

StructField (' pl' , DoubleType (), True)]) 

# Define the UDF, input and outputs are Pandas DFs 
@pandas_udf (schema, PandasUDFType.GROUPED_MAP) 
def analize_player(sample_pd) : 

# return empty params in not enough data 
if (len (sample_pd.shots) <= 1): 

return pd.DataFrame({' ID' : [sample_pd.player_id[0]], 

'p0' : [ 0 ], 'pl' : [ 0 ]}) 


# Perform curve fitting 

resuit = leastsq(fit, [1, 0], args=(sample_pd.shots, 

sample_pd.hits)) 

# Return the parameters as a Pandas DF 

return pd.DataFrame({' ID' : [sample_pd.player_id[0]], 

'p0' : [resuit [0] [0]], 'pl': [result[0][1]]}) 

# perform the UDF and show the results 

player_df = stats_df.groupby (' player_id '). apply (analyze_player) 
display(player_df) 


The key capability that Pandas UDFs provide is that they enable 
Python libraries to be used in a distributed environment, as long 
as yon have a good way of partitioning your data. This means that 
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ID 

v pO 

Pl 

8470085 

2 344963791971333 

-0 15734035549738007 

8471859 

0.552176162823148 

0.02217367140041992 

8475765 

0 7783287624631094 

-0 00016742139167102827 

8476426 

-2.1813661987835076e-12 

1 6666666666703023 

8476439 

2.2017087251697496 

0.08646011684051094 

8476445 

016666666666484875 

0.1666666666670303 


FIGURE 6.8: The output dataframe from the Pandas UDF. 


libraries such as Featuretools, which were not initially designed to 
work in a distributed environment, can be scaled to a large cluster. 
The resuit of applying the Pandas UDF from above on the stats 
data set is shown in Figure 6.8. This feature enables a mostly 
seamless translation between different dataframe formats. 

To further demonstrate the value of Pandas UDFs, we’ll apply 
them to distributing a feature generation pipeline and a deep learn- 
ing pipeline. However, there are some issues when using Pandas 
UDFs in workflows, because they can make debugging more of a 
challenge and sometimes fail due to data type mismatches between 
Spark and Pandas. 

6.3.5 Best Practices 

While PySpark provides a familiar environment for Python pro- 
grannners, it’s good to follow a few best practices to make sure 
you are using Spark efhciently. Here are a set of recommendations 
Pve compiled based on my experience porting a few projects from 
Python to PySpark: 

• Avoid dictionaries: Using Python data types such as dictionar- 
ies means that the code might not be executable in a distributed 
mode. Instead of using keys to index values in a dictionary, con- 
sider adding another column to a dataframe that can be used 
as a filter. This recommendation applies to other Python types 
including lists that are not distributable in PySpark. 

• Limit Pandas usage: Calling toPandas will cause all data to be 
loaded into memory on the driver node, and prevents operations 
from being performed in a distributed mode. It’s fine to use this 
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function when data has already been aggregated and you want 
to make use of familiar Python plotting tools, but it should not 
be used for large dataframes. 

• Avoid loops: Instead of using for loops, it’s often possible to 
use functional approaches such as group by and apply to achieve 
the same resuit. Using this pattern means that code can be par- 
allelized by supported execution environments. I’ve noticed that 
focusing on using this pattern in Python has also resulted in 
cleaner code that is easier to translate to PySpark. 

• Minimize eager operatioris: In order for your pipeline to be as 

scalable as possible, it’s good to avoid eager operations that pull 
full dataframes into memory. For example, reading in CSVs is an 
eager operation, and my work around is to stage the dataframe 
to S3 as Parquet before using it in later pipeline steps. 

. Use SQL: There are libraries that provide SQL operations 
against dataframes in both Python and PySpark. If you’re work- 
ing with someone else’s Python code, it can be tricky to decipher 
what some of the Pandas operations are achieving. If you plan 
on porting your code frorn Python to PySpark, then using a SQL 
library for Pandas can make this translation easier. 

By following these best practices when writing PySpark code, I’ve 
been able to improve both my Python and PySpark data Science 
workflows. 


6.4 MLlib Batch Pipeline 

Now that we’ve covered loading and transforming data with PyS¬ 
park, we can now use the machine learning libraries in PySpark to 
build a predictive model. The core library for building predictive 
models in PySpark is called MLlib. This library provides a suite of 
supervised and unsupervised algorithms. Wliile this library does 
not liave complete coverage of ali of the algorithms in sklearn, it 
provides functionality for the majority of the types of operations 
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needed for data Science workflows. In this chapter, we’ll show how 
to apply MLlib to a classification problem and save the outputs 
from the model application to a data lake. 


games_df = spark.read.csv("s3: //dsp-ch6/csv/games-expand.csv" , 

header=True, inferSchema = True) 
games_df.createOrReplaceTempVi ew( "games_df" ) 


games_df = spark.sql(. 

select *, row_number() over (order by rand()) as user_id 
,case when rand() > 0.7 then 1 else 0 end as test 
from games_df 

II II II ^ 

The first step in the pipcline is loading the data set that we want 
to use for model training. The snippet above shows how to load the 
games data set, and append two additional attributes to the loaded 
dataframe using Spark SQL. The resuit of running this query is 
that about 30% of users will be assigned a test value which we’ll 
use for model application, and each record is assigned a unique 
user ID which we 7 ll use when saving the model predictions. 

The next step is splitting up the data set into train and test 
dataframes. For this pipeline, we’ll use the test dataframe as the 
data set for model application, where we predict user behavior. 
An example of splitting up the dataframes using the test column 
is shown in the snippet below. This should resuit in roughly 16.lk 
training users and 6.8k test users. 

trainDF = games_df.fiIter ( "test == 0") 
testDF = games_df.fiIter ( "test == 1") 
print ( "Train " + str(trainDF.count ())) 
print("Test " + str(testDF.count ())) 
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6.4.1 Vector Columns 

MLlib requires that the input data is formatted using vector data 
types in Spark. To transform our dataframe into this format, we 
can use the VectorAssembler class to combine a range of columns into 
a single vector column. The code snippet below shows how to use 
this class to merge the Lrst 10 columns in the dataframe into a new 
vector column called features. After applying this command to the 
training dataframe using the transform function, we use the select 
function to retrieve only the values we need from the dataframe 
for model training and application. For the training dataframe, we 
only need the label and features, and witli the test dataframe we 
also select the user ID. 


from pyspark.ml.feature import VectorAssembler 


# create a vector representation 
assembler = VectorAssembler ( 

inputCols= trainDF.schema.names[0:10], 
outputCol="features" ) 


trainVec = assembler.transform (trainDF) .select (' label 1 
testVec = assembler.transform (testDF). select ( 

'label', 'features 


display (testVec) 


'features' ) 
'user_id' ) 


The display command shows the resuit of transforming our test 
dataset into vector types usable by MLlib. The output dataframe 
is visualized in Figure 6.9. 

6.4.2 Model Application 

Now that we have prepared our training and test data sets, we can 
use the logistic regression algorithm provided by MLlib to fit the 
training dataframe. We first create a logistic regression object and 
dehne the columns to use as labeis and features. Next, we use the 
fit function to train the model on the training data set. In the last 
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18 

0 
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►[0,10, [7], [ID 

24 


FIGURE 6.9: The features in the sparse vector format. 


step in the snippet below, we use the transform function to apply 
the model to our test data set. 


from pyspark.ml.classi fi cation import LogisticRegression 

# specify the columns for the model 

Ir = LogisticRegression (featuresCol=' features' , labelCol=' label' ) 

# fit on training data 
model = Ir.fit (trainVec) 

# predict on test data 

predDF = model.transform (testVec) 


The resulting dataframe now has a probability column, as shown in 
Figure 6.10. This column is a 2-element array with the probabilities 
for class 0 and 1. To test the accuracy of the logistic regression 
model on the test data set, we can use the binary classification 
evaluator in Mlib to calculate the ROC metric, as shown in the 
snippet below. For my run of the model, the ROC metric has a 
value of 0.761. 


from pyspark.ml.evaluation import BinaryClassifi cationEvaluator 

roc = BinaryClassifi cationEvaluator (). evaluate (predDF) 
print (roc) 
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FIGURE 6.10: The dataframe with propensity scores. 


In a production pipeline, there will not be labeis for the users 
tliat need predictions, meaning that you’ll need to perform cross 
validation to select the best rnodel for making predictions. An 
example of this approach is covered in Section 6.7. In this case, 
we are using a singlc data set to keep code examples short, but a 
similar pipeline can be used in a production workflows. 

Now that we have the rnodel predictions for our test users, we 
need to retrieve the predicted label in order to create a dataframe 
to persist to a data lake. Since the probability column created by 
MLlib is an array, we’ll need to dehne a UDF that retrieves the 
second element as our propensity column, as shown in the PySpark 
snippet below. 


from pyspark.sql.functions import udf 
from pyspark.sql.types import FloatType 

# split out the array into a column 

secondElement = udf(lambda v: float (v[1]), FloatType ()) 
predDF = predDF.select , 

secondElement ( "probability").alias("propensity")) 

display (predDF) 


After running this code block, the dataframe will have an addi- 
tional column called propensity as shown in Figure 6.10. The hnal 
step in this batch prediction pipeline is to save the results to S3. 
We’ll use the select function to retrieve the relevant columns from 
the predictions dataframe, and then use the write function on the 
dataframe to persist the results as Parquet on S3. 
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# save results to S3 

results_df = predDF. select ( "user__id" , "propensity") 
results_path = "s3a://dsp-ch6/game-predictions/" 
results_df.wri te.mode ( 1 overwrite 1 ) .parquet(results_path) 


We now have all of the building blocks needed to create a PySpark 
pipeline that can fetch data frorn a storage layer, train a predictive 
model, and write the results to persistent storage. We’ll cover how 
to schedule this type of pipeline in Section 6.8. 

When developing models, it’s useful to inspect the output to see 
if the distribution of model predictions matches expectations. We 
can use Spark SQL to perform an aggregation on the model out- 
puts and then use the display command to perform this process 
directly in Databricks, as shown in the snippet below. The resuit 
of performing these steps on the model predictions is shown in 
Figure 6.11. 

# plot the predictions 

predDF .createOrReplaceTempVi ew( "predDF ") 


plotDF = spark. sql(. 

select cast(propensity*100 as int)/100 as propensity, 
label, sum(l) as users 
frorn predDF 
group by 1, 2 
order by 1, 2 

II 11 II \ 


# table output 
display (plotDF) 


MLlib can be applied to a wide variety of problems using a large 
suite of algorithms. Whilc we explored logistic regression in this 
section, the libraries provides a number of different classification 
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FIGURE 6.11: The distributiori of propensity scores. 


approaches, and there are other types of operations supported in- 
cluding regression and clustering. 


6.5 Distributed Deep Learning 

While MLlib provides scalable implementations for classic machinc 
learning algorithms, it does not natively support deep learning 
libraries such as TensorFlow and PyTorch. There are libraries that 
parallelize the training of deep learning models on Spark, but the 
data set needs to be able to fit in memory on each worker node, 
and these approaches are best used for distributed hyperparameter 
tuning on medium-sized data sets. 

For the model application stage, where we already have a deep 
learning model trained but need to apply the resulting model to 
a large user base, we can use Pandas UDFs. With Pandas UDFs, 
we can partition and distribute our data set, run the resulting 
dataframes against a Keras model, and then compile the results 
back into a single large Spark dataframe. Tliis section will show 
how we can take the Keras model that we built in Section 1.6.3, 
and scale it to larger data sets using PySpark and Pandas UDFs. 
However, we stili have the requirement that the data used for train¬ 
ing the model can fit into memory on the driver node. 

We’ll use the sanie data sets from the prior section, where we split 
the games data set into training and test sets of users. This is a 
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relatively small data set, so we can use the toPandas operation to 
load the dataframe onto the driver node, as shown in the snippet 
below. The resuit is a dataframe and list that we can provide as 
input to train a Keras deep learning model. 

# build model on the driver node 
train_pd = trainDF.toPandas () 
x_train = train_pd.iloc[:,0:10] 
y_train = train_pd [' label' ] 


When using PyPI to install TensorFlow on the Spark cluster, the 
installed version of the library should be 2.0 or greater. This differs 
from Version 1 of TensorFlow, which we used in prior chapters. 
The main impact in terms of the code snippets is that TensorFlow 
2 now has a built-in AUC function that no longer requires the 
workflow we previously applied. 

6.5.1 Model Training 

We’ll use the sarne approach as before to train a Keras model. The 
code snippet below shows how to set up a network with an in¬ 
put layer, dropout later, single hidden layer, and an output layer, 
optimized with rmsprop and a cross entropy loss. In the model ap- 
plication phase, we’ll reuse the model object in a Pandas UDFs to 
distribute the workload. 

import tensorflow as tf 
import keras 

from keras import models, layers 


model = models.Sequential () 

model .add(layers.Dense(64, activation='relu', input_shape=(10,))) 
model.add(layers.Dropout (0.1)) 
model .add(layers.Dense(64, activation=' relu' )) 
model .add(layers.Dense(l, activation='si gmoid' )) 

model.compile (optimizer= 1 rmsprop 1 , loss= 'binary_crossentropy 1 ) 


6.5 Distributed Deep Learning 


163 


history = model.fit (x_train, y_train, epochs=100, batch_size=100, 
validation_split = .2, verbose=0) 

To test for overfitting, we can plot the results of the training and 
validation data sets, as shown in Figure 6.12. The snippet bclow 
shows how to use matplotlib to display the losses over time for 
these data sets. Wliile the training loss continued to decrease over 
additional epochs, the validation loss stopped improving after 20 
epochs, but did not noticeably increase over time. 


import matplotlib.pyplot as plt 

loss = history.history [ 'loss'] 
val_loss = history.history[ 'val_loss' ] 
epochs = range(l, len(loss) + 1) 

fig = plt. fi gure (fi gsi ze= (10,6) ) 

plt.plot (epochs, loss, 'bo 1 , label= 1 Training Loss') 

plt.plot (epochs, val_loss, 'b', label= 'Validation Loss') 

plt.legend () 

plt. show( ) 

display(fig) 


6.5.2 Model Application 

Now that we have a trained deep learning model, we can use PyS- 
park to apply it in a scalable pipeline. The first step is determining 
how to partition the set of users that need to be scored. For this 
data set, we can split the user base into 100 different buckets, as 
shown in the snippet below. This randomly assigns each user into 1 
of 100 buckets, wliich means that after applying the group by step, 
each dataframe that gets translated to Pandas will be roughly 1% 
of the size of the original dataframe. If you have a large data set, 
you may need to use thousands of buckets to distribute the data 
set, and maybe more. 
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FIGURE 6.12: Training a Keras model on a subset of data. 


# set up partitioning for the train dataframe 
testDF.createOrReplaceTempVi ew(" testDF " ) 


partitionedDF = spark.sql(. 

select *, cast(rand()*100 as int) as partition_id 
from testDF 

II II II ^ 

The next step is to define the Pandas UDF that will apply the 
Keras model. We’ll define an outpnt schema of a user ID and 
propensity score, as shown below. The UDF uses the predict func- 
tion on the model object we previously trained to create a pre- 
diction column on the passed in dataframe. The return command 
selects the two relevant columns that we defined for the schema 
object. The group by command partitions the data set using our 
bucketing approach, and the apply command performs the Keras 
model application across the cluster of worker nodes. The resuit is 
a Spark dataframe visualized witli the display command, as shown 
in Figure 6.13. 
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user_id 

propensity 

111 

012130042910575867 

338 

0.018362104892730713 

352 

0 016818374395370483 

657 

0.08839917182922363 

2204 

0.012326300144195557 

2501 

0.03954136371612549 

2703 

0 3526822328567505 

3504 

0.037293970584869385 

4780 

0 33306148648262024 


FIGURE 6.13: The resulting dataframe for distributed Keras. 


from pyspark.sql.functions import pandas_udf, PandasUDFType 
from pyspark.sql.types import * 

schema = StructType ( [StructField ( 'user_id 1 , LongType(), True), 

StructField( 'propensity' , DoubleType() ,True)]) 

@pandas_udf(schema, PandasUDFType.GROUPED_MAP) 
def apply_keras (pd): 

pd ['propensity' ] = model.predict (pd.iloc[:,0:10]) 
return pd [ [ 'user_id' , 'propensity']] 

results_df=partitionedDF.groupby( 'partition_id' ) .apply(apply_keras) 
display (results_df) 


One thing to note is that there are limitations on the types of 
objects that you can reference in a Pandas UDFs. In this example, 
we referenced the model object, which was created on the driver 
node when training the model. When variables in PySpark are 
transferred from the driver node to workers nodes for distributed 
operations, a copy of the variable is made, because synchronizing 
variables across a cluster would be inefficient. This means that any 
changes made to a variable within a Pandas UDF will not apply 
to the original object. It’s also why data types sucli as Python lists 
and dictionaries should be avoided when using UDFs. Functions 
work in a similar way, and in Section 6.3.4 we used the fit function 
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in a Pandas UDF where the function was initially defined on the 
driver node. Spark also provides broadcast variables for sharing 
variables in a cluster, but ideally distributed code segments should 
avoid sharing state through variables if possible. 


6.6 Distributed Feature Engineering 

Feature engineering is a key step in a data Science workflow, and 
sometimes it is necessary to use Python libraries to implement 
this functionality. For example, the AutoModel system at Zynga 
uses the Featuretools library to generate hundreds of features frorn 
raw tracking events, wliich are then used as input to classification 
models. To scale up the automated feature engineering approach 
tliat we first explored in Section 1.7, we can use Pandas UDFs to 
distribute the feature application process. Like the prior section, 
we need to sarnple data when determining the transformation to 
perform, but when applying the transformation we can scale to 
massive data sets. 

For this section, we’ll use the game plays data set from the NHL 
Kaggle example, which includes detailcd play-by-play descriptions 
of the events that occurred during each match. Our goal is to 
transform the deep and narrow dataframe into a shallow and wide 
dataframe that summarizes each game as a single record witli hun¬ 
dreds of columns. An example of loading this data in PySpark and 
selecting the relevant columns is shown in the snippet below. Be- 
fore calling toPandas, we use the hlter function to sample 0.3% of 
the records, and then cast the resuit to a Pandas frarne, which has 
a shape of 10,717 rows and 16 columns. 


plays_df = spark.read.csv ( "s3://dsp-ch6/csv/game_plays.csv" , 
header=True, inferSchema = True).drop( 

'secondaryType' , 'periodType' , 'dateTime', ' ri nk_side' ) 
plays_pd = plays_df.fiIter (" rand() < 0.003" ). toPandas () 
plays_pd.shape 
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6.6.1 Feature Generation 

We’ll use the same two-step process covered in Section 1.7 where 
we first one-hot encode the categorical features in the dataframe, 
and then apply deep feature synthesis to the data set. The code 
snippet below shows how to perform the encoding process using 
the Featuretools library. The output is a transformation of the 
initial dataframe that now has 20 dummy variables instead of the 
event and description variables. 


import featuretools as ft 

from featuretools import Feature 

es = ft.EntitySet (i d="plays" ) 

es = es.entity_from_dataframe (entity_i d="plays" ,dataframe=plays_pd, 
i ndex="play_i d" , variable_types = { 

"event" : ft.variable_types.Categorical, 
"description": ft.variable_types.Categorical }) 

fl = Feature(es ["plays"]["event"] ) 

f2 = Feature (es ["plays"]["deseription"] ) 


encoded, defs = ft.encode_features (plays_pd, [fl, f2], top_n=10) 
encoded.reset_i ndex(inplace=True) 


The next step is using the dfs function to perform deep feature 
synthesis on our encoded dataframe. The input dataframe will have 
a record per play, while the output dataframe will have a single 
record per garne after collapsing the detailed events into a wide 
column representation using a variety of different aggregations. 


es = ft.EntitySet (i d="plays" ) 

es = es.entity_from_dataframe (entity_i d="plays" , 

dataframe=encoded, i ndex="play_id" ) 

es = es.normalize_entity (base_entity_id="plays", 

new_entity_i d="games" , i ndex="game_i d" ) 
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Out[56]: Index(['game_id', 'SUMplaysdescriptionlcing', 'SUMplayseventPenalty', 
'SUMplayseventTakeaway', 'SUMplaysgoals_away', 

'SUMplaysdescriptionPeriodOfficial', 'SUMplayseventShot', 

'SUMplaysdescriptionPuckinBenches', 'SUMplaysdescriptionPuckinNettlng', 
'SUMplaysdescriptionPeriodReady', 

'NUM_UNIQUEplaysst_x', 'NUM_UNIQUEplaysx', 'NUM_UNIQUEplaysst_y', 
, NUM_UNIQUEplaysteam_id_aga^nst , , 'MODEplaysteam_id_for', 'MODEplaysy', 
'MODEplaysst_x', 'MODEplaysx', 'MODEplaysst_y', 

'MODEplaysteam_id_against'], 
dtype='object', length=188) 


FIGURE 6.14: The schema for the generated features. 


features, transfor m=ft.dfs (entityset=es, 

target_enti ty="games" ,max_depth=2) 
features.reset_i ndex(inplace=True) 


One of the new steps that we need to perform versus the prior 
approach, is that we need to determine what the schema will be 
for the generated features, since this is needed as an input to the 
Pandas UDF annotation. To figure out what the generated schema 
is for the generated dataframe, we can create a Spark dataframe 
and then retrieve the schema from the dataframe. Before convert- 
ing the Pandas dataframe, we need to modify the column narnes 
in the generated dataframe to remove special characters, as shown 
in the snippet below. The resulting Spark schema for the feature 
application step is displayed in Figure 6.14. 


features.columns = features.columns.str.replace ("[() . =]", "") 
schema = sqlContext.createDataFrame( features).schema 
features.columns 


We now liave the required schema for dehning a Pandas UDF. 
Unlike the past UDFs we defined, the schema may change between 
different runs based on the feature transformation aggregations 
selected by Featuretools. In these steps, we also created a defs 
object that defines the feature transformations to use for encoding 
and a transform object that dehnes the transformations to perform 
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deep feature synthesis. Like the model object in the past section, 
copies of these objects will be passed to the Pandas UDF executing 
on worker nodes. 

6.6.2 Feature Application 

To enable our approach to scale across a cluster of worker nodes, 
we need to dehne a column to use for partitioning. Like the prior 
section, we can bucket events into different sets of data to ensure 
tliat the UDF process can scale. One difference frorn before is that 
we need all of the plays frorn a specihc game to be grouped into 
the sanie partition. To achieve this resuit, we can partition by the 
game_id rather tlian the player_id. An example of this approach 
in shown in the code snippet below. Additionally, we can use the 
liash function on the game ID to randomize the value, resulting in 
more balanced bucket sizes. 


# bucket IDs 

plays_df.createOrReplaceTempVi ew( "plays_df" ) 
plays_df = spark.sql(. 

select *, abs (hash (game_id) )?ol000 as partition_ _ id 
frorn plays_df 

II II II ^ 

We can now apply feature transformation to the full data set, using 
the Pandas UDF dehned below. The plays dataframe is partitioned 
by the bucket before being passed to the generate features func¬ 
tion. This function uses the previously generated feature transfor- 
mations to ensure that the sanie transformation is applied across 
all of the worker nodes. The input Pandas dataframe is a narrow 
and deep representation of play data, wliile the returned dataframe 
is a shallow and wide representation of game summaries. 


frorn pyspark.sql.functions import pandas_udf, PandasUDFType 


@pandas_udf(schema, PandasUDFType.GROUPED_MAP) 
def gen_features (plays_pd): 
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game_id SUMplaysdescriptionlcing 


2011030222 0 
2011030223 0 
2011030224 0 
2011030324 0 
2011030225 0 
2011030232 0 
2011030412 0 


SUMplayseventPenalty SUMplaysgoals_away SUMplaysdescriptionPeriodOfficlal 

SUMplayseventTakeaway 

0 0 10 

1 0 2 0 

0 0 2 0 

0 fo fi fo 

o [o fi o 

0 0 5 0 

0 0 10 


SUMplayseventShot 

2 

0 

0 


0 

0 

0 


FIGURE 6.15: Generated features in the Spark dataframe. 


es = ft.EntitySet (1d="plays" ) 

es = es.entity_from_dataframe (entity_i d="plays" , 

dataframe=plays_pd, i ndex="play_i d" , vari able_types = { 
"event" : ft.variable_types.Categorical, 
"descriptiori": ft. vari able_types . Categori cal }) 
encoded_features = ft.calculate_feature_matrix (defs, es) 
encoded_features.reset_i ndex(inplace=True) 

es = ft.EntitySet (id="plays" ) 

es = es.entity_from_dataframe (entity_i d="plays" , 

dataframe=encoded, i ndex="play_i d" ) 
es = es.normalize_entity (base_entity_id="plays", 

new_entity_i d="games" , i ndex="game_i d" ) 
generated = ft.calculate_feature_matrix (transform,es) .filina (0) 

generated.reset_index (inplace=T rue) 

generated.columns = generated.columns.str.replace ("[() . =]","") 
return generated 

features_df = plays_df.groupby( 'partition_id' ) .apply(gen_ features) 
display (features_df) 


The output of the display command is shown in Figure 6.15. We’ve 
now worked through feature generation and deep learning in scal- 
able model pipelines. Now that we have a transformed data set, we 
can join the resnlt with additional features, such as the label that 
we are looking to predict, and develop a complete model pipeline. 
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6.7 GCP Model Pipeline 

A common workflow for batch model pipelines is reading input 
data from a lake, applying a machine learning model, and then 
writing the results to an application database. In GCP, BigQuery 
serves as the data lake and Cloud Datastore can serve as an ap¬ 
plication database. We’ll build and end-to-end pipeline with these 
components in the next chapter, but for now we’ll get hands on 
with a subset of the GCP components directly in Spark. 

While there is a Spark connector for BigQuery 3 , enabling large- 
scale PySpark pipelines to be built using BigQuery directly, there 
are some issues with this library that rnake it quite complicated 
to set up for our Databricks environment. For example, we would 
need to rebuild some of the JAR hies and shade the dependencies. 
One alternative is to use the Python BigQuey connector that we 
explored in Section 5.1, but this approach is not distributed and 
will eagerly pull the query results to the driver node as a Pandas 
dataframe. For this chapter, we’ll explore a workflow where we 
unload query results to Cloud Storage, and then read in the data 
set from GCS as the initial step in the pipeline. Similarly, for model 
output we’ll save the results to GCS, where the output is availablc 
for pushing to Cloud Datastore. To productize this type of model 
workflow, Airhow could be used to chain these different actions 
together. 

6.7.1 BigQuery Export 

The hrst step we’ll perform is exporting the results of a BigQuery 
query to GCS, which can be performed manually using the Big¬ 
Query UI. This is possible to perform directly in Spark, but as I 
mentioned the setup is quite involved to conhgure with the current 
version of the connector library. We’ll use the natali ty data set for 
this pipeline, which lists attributes about child deliveries, such as 
birth weight. 


3 


https://github.com/spotify/spark-bigquery/ 
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create table dsp_demo.natality as ( 
select * 

from bigquery-public-data.samples.natality' 
order by rand() 

"limit 10000 

) 

To create a data set, we’ll sample lOk records from the natality 
public data set in BigQuery. To export this resuit set to GCS, we 
need to create a table on BigQuery with the data that we want 
to export. The SQL for creating this data sample is shown in the 
snippet above. To export this data to GCS, perform the following 
steps: 

1. Browse to the GCP Console 

2. Search for “BigQuery” 

3. Paste the Query from the snippet above into the editor 

4. Click Run 

5. In the left pane, select the table, “dsp_demo.natality” 

6. Click “Export”, and then “Export to GCS” 

7. Set the location, “ / dsp mo de 1 st ore / nat al i ty / avr o ’ ’ 

8. Use “Avro” as export format 

9. Click “Export” 

After performing these steps, the sampled natality data will be 
saved to GCS in Avro format. The conkrmation dialog from ex- 
porting the data set is shown in Figure 6.16. We now have the 
data saved to GCS in a format that works well with Spark. 

6.7.2 GCP Credentials 

We now have a data set that we can use as input to a PySpark 
pipeline, but we doiTt yet have access to the bucket on GCS from 
our Spark environment. With AWS, we were able to set up pro- 
grammatic access to S3 using an access and secret key. With GCP, 
the process is a bit more complicated because we need to move the 
json credentials file to the driver node of the cluster in order to 
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<- Object details 


♦ DOWNLOAD II EDIT PERMISSIONS g DELETE 


Buckets / dsp_model_store / natality / avro 


Type 


Access 


Size 


Not public 

application/octet-stream 
774.7 KB 


Created 


November 17,2019 at 6:07:38 PM UTC-8 


Last modified 


November 17,2019 at 6:07:38 PM UTC-8 


URI 


gs://dsp_model_store/natality/avro 


Link URL 



FIGURE 6.16: Confirming the Avro export on GCS. 

read and write files on GCS. One of the challenges with using 
Spark is that you may not liave SSH access to the driver node, 
which means that we’ll need to use persistent storage to move the 
file to the driver machine. This isn’t recommended for production 
environments, but instead is being shown as a proof of concept. 
The best practice for managing credentials in a production envi- 
ronment is to use IAM roles. 

aws s3 cp dsdemo.json s3: //dsp-ch6/secrets/dsdemo .json 
aws s3 Is s3: //dsp-ch6/secrets/ 

To move the json file to the driver node, we can first copy the 
credentials file to S3, as shown in the snippet above. Now we can 
switch back to Databricks and autlior the model pipeline. To copy 
the file to the driver node, we can read in the file using the sc 
Spark context to read the file line by line. This is different from 
all of our prior operations where we have read in data sets as 
dataframes. After reading the file, we then create a file on the 
driver node using the Python open and write functions. Again, this 
is an unusual action to perform in Spark, because you typically 
want to write to persistent storage rather than local storage. The 
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resuit of performing these steps is that the credentials file will now 
be available locally on the driver node in the cluster. 


creds_file = '/databricks/creds.json' 

creds = sc.textFile (' s3://dsp-ch6/secrets/dsdemo.json 1 ) 

with open (creds_file, 'w') as file: 
for line in creds.take (100): 
file.write (line + "\n") 


Now that we have the json credentials file moved to the driver local 
storage, we can set np the Hadoop conhguration needed to access 
data on GCS. The code snippet below shows how to conhgure 
the project ID, file System implementation, and credentials file 
location. After rnnning these commands, we now have access to 
read and write hies on GCS. 


sc._j sc.hadoopConfiguration () .set ( "fs.gs.impl" , 

"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem" ) 
sc._j sc.hadoopConfigurat ion(). set ("fs.gs.project.id", 

"your_project_id") 


sc._j sc.hadoopConfiguration () .set ( 

"mapred.bq.auth.servi ce.account.json.keyfile" , creds_file) 
sc._j sc.hadoopConfiguration () .set ( 

"fs.gs.auth.servi ce.account.j son.keyfile" , creds_file) 


6.7.3 Model Pipeline 

To read in the natality data set, we can use the read function with 
the Avro setting to fetch the data set. Since we are using the Avro 
format, the dataframe will be lazily loaded and the data is not 
retrieved until the display command is used to sample the data 
set, as shown in the snippet below. 
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natality_path = "gs://dsp_model_store/natality/avro" 
natality_df = spark.read.format ( "avro" ).load (natality_path) 
display (natality_df) 


Before we can use MLlib to build a regression model, we need to 
perform a few transformations on the data set to select a subset 
of the features, cast data types, and split records into training and 
test groups. We’ll also use the filina function as shown below in 
order to replace any null values in the dataframe with zeros. For 
this modeling exercise, we’ll build a regression model that predicts 
the birth weiglit of a baby using a few different features including 
the marriage status of the mother and parent ages. The prepared 
dataframe is shown in Figure 6.17. 


natality_df.createOrReplaceTempVi ew( "natality_df" ) 

natality_df = spark. sql(. 

SELECT year, plurality, apgar_5min, 
mother_age, father_age, 
gestation_weeks, ever_born 
,case when mother_married = true 

then 1 else 0 end as mother_married 
,weight_pounds as weight 

,case when rand() < 0.5 then 1 else 0 end as test 
from natality_df 
.).fi lina(0) 

trainDF = natality_df.fiIter ( "test == 0") 
testDF = natality_df.fiIter ( "test == 1") 
display (natality_df) 


Next, we’ll translate our dataframe into the vector data types that 
MLlib requires as input. The process for transforming the natality 
data set is shown in the snippet below. After executing the trans- 
form function, we now have training and test data sets we can use 
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2006 

1985 

1998 

1982 

1991 

1980 

1997 

1974 


plurality 


apgar_5min 

99 


9 

9 


9 

9 

0 


mother_age 


25 

22 

19 

19 

28 

17 

21 

23 

17 


father_age 

29 


21 

28 

20 

23 

26 

21 


gestation_weeks 


41 

41 

41 

36 

41 

35 

35 


39 

42 


ever_bom mother_married 

1 1 

2 0 

1 0 

1 1 

2 1 

1 0 

2 1 

1 1 

1 0 


weight ▼ test 

8 12623897732 0 

9.56365292556 1 

6 9996768185 0 

5.8135898489399995 1 

856275425608 0 

850102482272 0 

500008410216 1 

8.56275425608 0 

8.437090766739999 0 


FIGURE 6.17: The prepared Natality dataframe. 


as input to a regression model. The label we are building a model 
to predict is the weight column. 


from pyspark.ml.feature import VectorAssembler 
# create a vector representation 

assembler = VectorAssembler (inputCols= trainDF.schema.names [0:8], 

outputCol="features" ) 


trainVec = assembler.transform(trainDF) .select('weight' , 'features') 
testVec = assembler.transform (testDF) .select ( 1 weight' , 'features') 

MLlib provides a set of Utilities for performing cross validation 
and hyperparameter tuning in a model workflow. The code snip- 
pet below shows how to perform this process for a random forest 
regression model. Instead of calling fit directly on the model ob- 
ject, we wrap the model object with a cross validator object that 
explores different parameter settings, snch as tree depth and num- 
ber of trees. This workflow is similar to the grid search functions in 
sklearn. After searching throngh the parameter space, and using 
cross validation based on the number of folds, the random forest 
model is retrained on the complete training data set before being 
applied to make predictions on the test data set. The resuit is a 
dataframe with the actual weight and predicted birtli weight. 


from pyspark.ml.tuning import ParamGridBuiIder 

from pyspark.ml.regression import RandomForestRegressor 

from pyspark.ml.tuning import CrossValidator 
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from pyspark.ml.evaluation import RegressionEvaluator 

folds = 3 

rf_trees = [50, 100 ] 

rf_depth = [ 4, 5 ] 

rf= RandomForestRegressor(featuresCol='features' ,labelCol= 1 weight' ) 

paramGrid = ParamGridBuiIder () .addGrid (rf.numTrees, rf_trees). 

ddGrid (rf.maxDepth, rf_depth) .build() 
crossval = CrossValidator (estimator=rf, estimatorParamMaps = 

paramGrid, evaluator=Regressi onEvaluator ( 
labelCol= 1 weight' ), numFolds = folds) 
rfModel = crossval.fit (trainVec) 

predsDF = rfModel.transform (testVec) .select ( "weight" , "prediction" ) 


In the final step of our GCP model pipeline, we’ll save the results 
to GCS, so that other applications or processes in a workflow can 
make use of the predictions. The code snippet below shows how 
to write the dataframe to GCS in Avro format. To ensure that 
different runs of the pipeline do not overwrite past predictions, we 
append a timestamp to the export path. 


import time 

out_path = "gs://dsp_model_store/natality/preds-{time}/" . 

format(time = int(time.time ()*1000)) 
predsDF.write.mode ( 1 overwri te') .format("avro") .save(out_path) 
print (out_path) 


Using GCP components witli PySpark took a bit of effort to con- 
figure, but in this case we are running Spark in a different cloud 
provider tlian where we are reading and writing data. In a pro- 
duction environment, you’ll most likely be running Spark in the 
same cloud as where you are working with data sets, which means 
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that you can leverage IAM roles for properly managing access to 
different Services. 


6.8 Productizing PySpark 

Once you’ve tested a batch rnodel pipeline in a notebook envi- 
ronment, there are a few different ways of scheduling the pipeline 
to run on a regular schedule. For example, you may want a churn 
prediction model for a mobile game to run every morning and pub- 
lish the scores to an application database. Similar to the workflow 
tools we covered in Chapter 5, a PySpark pipeline should have 
monitoring in place for any failures that may occur. There’s a few 
different approaches for scheduling PySpark jobs to run: 

• Workflow Tools: Airflow, Azkaban, and Luigi all support run- 
ning spark jobs as part of a workflow. 

• Cloud Tools: EMR on AWS and Dataproc on GCP support 
scheduled Spark jobs. 

• Vendor Tools: Databricks supports setting up job schedules 
witli monitoring through the web UI. 

• Spark Submit: If you already have a cluster provisioned, you 
can issue spark-submit commands using a tool such as crontab. 

Vendor and cloud tools are typically easier to get up and running, 
because they provide options for provisioning clusters as part of 
the workflow. For example, with Databricks you can define the 
type of cluster to spin up for running a notebook on a schedule. 
When using a workflow tool, such as Airflow, you’ll need to add 
additional steps to your workflow in order to spin up and termi¬ 
nate clusters. Most workflow tools provide connectors to EMR for 
managing clusters as part of a workflow. The Spark submit option 
is useful when first getting started with scheduling Spark jobs, but 
it doesn’t support managing clusters as part of a workflow. 

Spark jobs can run on ephemeral or persistent clusters. An 
ephemeral cluster is a Spark cluster that is provisioned to per- 
forrn a set of tasks and then terminated, such as running a churn 
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model pipcline. A persistent cluster is a long-running cluster than 
may support interactive notebooks, such as the Databricks cluster 
we set up at the start of this chapter. Persistent clusters are useful 
for development, but can be expensive if the hardware spun up for 
the cluster is under utilized. Some vendors support auto scaling 
of clusters to reduce the cost of long-running persistent clusters. 
Ephemeral clusters are useful, because spinning up a new cluster 
to perform a task enables isolation of failure across tasks, and it 
means that different model pipelines can use different library ver- 
sions and Spark runtimes. 

In addition to setting up tools for scheduling jobs and alerting 
on job failures, it’s useful to set up additional data and model 
quality checks for Spark model pipelines. For example, I’ve set up 
Spark jobs that perform audit tasks, such as making sure that 
an application database has predictions for the current day, and 
trigger alerts if prediction data is stale. It’s also a good practice to 
log metrics, such as the ROC of a cross-validated model, as part 
of a Spark pipeline. 


6.9 Conclusion 

PySpark is a powerful tool for data scientists to build scalable 
analyses and model pipelines. It a highly desirable skill set for 
companies, because it enables data Science teams to own more of 
the process of building and owning data products. There’s a variety 
of ways to set up an environment for PySpark, and in this chapter 
we explored a free notebook environment from one of the popular 
Spark vendors. 

This chapter focused on batch model pipelines, where the goal is 
to create a set of predictions for a large number of users on a 
regular schedule. We explored pipelines for both AWS and GCP 
deployments, where the data sources and data outputs are data 
lakes. One of the issues with these types of pipelines is that predic¬ 
tions may be quite stale by the time that a prediction is used. In 
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Chapter 8, we’ll explore streaming pipelines for PySpark, where 
the latency of model predictions is minimized. 

PySpark is a highly expressive language for authoring model 
pipelines, because it supports ali Python functionality, but does 
require sorne workarounds to get code to execute across a cluster 
of workers nodes. In the next chapter we’ll explore Dataflow, a run- 
tirne for the Apache Beam library, wliich also enables large-scale 
distributed Python pipelines, but is more constrained in the types 
of operations that you can perform. 
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Cloud Dataflow for Batch Modeling 


Dataflow is a tool for building data pipelines that can run locally, 
or scale up to large clusters in a managed environment. Wliile 
Cloud Dataflow was initially incubated at Google as a GCP specific 
tool, it now builds upon the open-source Apache Beam library, 
making it usable in other cloud environments. The tool provides 
input connectors to different data sources, such as BigQuery and 
files on Cloud Storage, operators for transforming and aggregating 
data, and output connectors to Systems such as Cloud Datastore 
and BigQuery. 

In this chapter, we’ll build a pipclinc with Dataflow that reads in 
data from BigQuery, applies a sklearn model to create predictions, 
and then writes the predictions to BigQuery and Cloud Datastore. 
We’ll start by running the pipeline locally on a subset of data and 
then scale up to a larger data set using GCP. 

Dataflow is designed to enable highly-scalable data pipelines, such 
as performing ETL work where you need to move data between 
different Systems in your cloud deployment. It’s also been extended 
to work well for building ML pipelines, and there’s built-in support 
for TensorFlow and other machine learning methods. The resuit is 
that Dataflow enables data scientists to build large scale pipelines 
without needing the support of an engineering team to scale things 
up for production. 

The core component in Dataflow is a pipeline, which defines the op- 
erations to perform as part of a workflow. A workflow in Dataflow 
is a DAG that includes data sources, data sinks, and data trans- 
formations. Here are some of the key components: 
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• Pipeline: Defines the set of operations to perform as part of a 
Dataflow job. 

• Collectiori: The interface between different stages in a workflow. 
The input to any step in a workflow is a collection of objects and 
the output is a new collection of objects. 

• DoFn: An operation to perform on eacli element in a collection, 
resulting in a new collection. 

• Transform: An operation to perform on sets of clements in a 
collection, such as an aggregation. 

Dataflow works with multiple languages, but we’ll focus on the 
Python implementation for this book. There are some caveats with 
the Python version, because worker nodes may need to compile 
libraries frorn source, but it does provide a good introduction to 
the different components in Apache Beam. To create a workflow 
with Beam, you use the pipe syntax in Python to chain different 
steps together. The resuit is a DAG of operations to perform that 
can be distributed across machines in a cluster. 

The two ways of transforming data in a Dataflow pipeline are DoFn 
and Transform steps. A DoFn step defines an operation to perform on 
each object in a collection. For example, we’ll query the Natality 
public data set and the resulting collection will contain dictionary 
objects. We’ll define a DoFn operation that uses sklearn to create a 
prediction for each of these dictionary objects and output a new 
dictionary object. A Transform defines an operation to perform on 
a set of objects, such as performing feature generation to aggre¬ 
gate raw tracking events into user-level summaries. These types of 
operations are typically usecl in combination with a partition trans¬ 
form step to divide up a collection of objects into a manageable 
size. We won’t explore this process in this book, but a transform 
could be used to apply Featuretools to perform automated feature 
engineering as part of a Dataflow pipeline. 

In this chapter we’ll get hands on with building Dataflow pipelines 
that can run locally and in a fully-managed GCP cluster. We’ll 
start by building a simple pipeline that works with text data, and 
then build a pipeline that applics a sklearn model in a distributed 
workflow. 
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7.1 Apache Beam 

Apache Beam is an open-source library for building data processing 
workflows using Java, Python, and Go. Beam workflows can be ex- 
ecuted across several execution engines including Spark, Dataflow, 
and MapReduce. Witli Beam, you can test workflows locally us¬ 
ing the Di rect Runner for execution, and then deploy the workflow 
in GCP using the Dataflow Runner. Beam pipelines can be batch, 
where a workflow is executed until it is completed, or streaming, 
where the pipclinc runs continuously and operations are performed 
in near real-time as data is received. We’ll focus on batch pipelines 
in this chapter and cover streaming pipelines in the next chapter. 

The first thing we’ll need to do in order to get up and running 
is install the Apache Beam library. Run the commands shown be- 
low from the command line in order to install the library, set up 
credentials for GCP, and to run a test pipeline locally. The pip 
command includes the gcp annotation to specify that the Dataflow 
modules should also be installcd. If the last step is successful, the 
pipeline will output the word counts for Shakespeare’s King Lear. 

# install APache Bean 

pip install —user apache-beam[gcp] 


# set up GCP credentials 

export G00GLE_APPLICATI0N_CREDENTIALS=/home/ec2-user/dsdemo.json 


# run the word count example 

python3 -m apache_beam.examples.wordcount --output outputs 


The example pipeline performs a number of different steps in order 
to perform this counting logic. First, the pipeline reads in the play 
as a collection of string objects, where each line from the play is a 
string. Next, the pipeline splits each line into a collection of words, 
which are then passed to map and group transforms that count 
the occurrence of each word. The map and group operations are 
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built-in Bean transform operations. The last step is writing the 
collectiori of word counts to the console. 

Instead of walking through the example code in detail, we’ll build 
our own pipeline that more closely resembles the workflow of build- 
ing a batch model application pipeline. The listing below shows the 
full code for building and running a pipeline that reads in the play 
from Cloud Storage, appends a message to the end of every line of 
text, and writes the results back to Cloud Storage. The complete 
pipeline can be executed from within a Jupyter notebook, wliich 
is a uscful way of getting up and running with simple pipelines 
when learning Dataflow. 


import apache_beam as beam 
import argparse 

from apache_beam.options.pipeline_options import PipelineOptions 
from apache_beam.io import ReadFromText 
from apache_beam.io import WriteToText 

# define a function for transforming the data 
class AppendDoFn (beam.DoFn): 

def process (self, element): 

return element + " - Helio World!" 

# set up pipeline parameters 
parser = argparse.ArgumentParser () 
parser.add_argument( '--input' , dest='input' , 

default= 'gs://dataflow-samples/shakespeare/ki nglear.txt') 
parser.add_argument ( 1 --output' , dest='output' , 

default= 'gs://dsp_model_store/shakespeare/kinglear.txt 1 ) 
known_args, pipeline_args = parser.parse_known_args (None) 
pipeline_options = PipelineOptions (pipeline_args) 

# define the pipeline steps 

p = beam.Pipeline (options=pipeline_options) 

lines = p | 'read 1 >> ReadFromText (known_args.input) 

appended = lines | 1 append 1 >> beam.ParDo(AppendDoFn ()) 
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appended | 'write' >> WriteToText (known_args.output) 


# run the pipeline 
resuit = p.run () 
result.wait_until_finish() 

The first step in this code is to load the necessary modules needed 
in order to set up a Beam pipeline. We import IO methods for 
reading and writing text files, and Utilities for passing parameters 
to the Beam pipeline. Next, we define a class tliat will perform 
a DoFn operation on every element passed to the process function. 
This class extends the beam.DoFn class, whicli provides an interface 
for processing elements in a collection. The third step is setting up 
parameters for the pipeline to use for execution. For this example, 
we need to set up the input location for reading the text and output 
location for writing the resuit. 

Once we have set up the pipeline options, we can set up the DAG 
that defines the sequence of actions to perform. For this example, 
we’ll create a simple sequence where the input text is passed to 
our append step and the output is passed to the text writer. A 
visualization of this pipeline is sliown in Figure 7.1. To construet 
the DAG, we use pipe (|) commands to chain the different steps 
together. Eacli step in the pipeline is a ParDo or Transform command 
that defines the Beam operation to perform. In more complicated 
workflows, an operation can have multiple outputs and multiple 
inputs. 

Once the pipeline is constructed, we can use the run function to 
exeeute the pipeline. When running this example in Jupyter, the 
Direct Runner will be used by Beam to exeeute the pipeline on 
the local machine. The last command waits for the pipeline to 
complete before proceeding. 

With the Direct Runner, all of the global objects defined in the 
Python file can be used in the DoFn classes, because the code is run¬ 
ning as a single process. When using a distributed runner, some ad- 
ditional steps need to be performed to rnake sure that the class has 
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Job summary 


a 

read 

Job name 

beamapp-ec2-user-l 214173532-562300 



Job ID 

2019-12-14 09 35 36-15815290002334700075 


0 sec 

Region 

us-centrall 



Job status 

O Running 




Stop job 

a 

split 

SDK version 

Apache Beam Python 3.7 SDK 2.16.0 


0 sec 

Jobtype 

Batch 



Start time 

December 14, 2019 at 9:35:37 AM UTC-8 



Elapsed time 

3 min 24 sec 

a 

write 

Encryption type 

Google-managed key 


15 sec 

Autoscaling 9 




Workers 

1 ->0 



Current state 

Worker pool started. 

FIGURE 7.1: Running the managed pipeline on GCP. 

<- 

Object details 

± DOWNLOAD 

11 EDIT PERMISSIONS ■ DELETE 


Buckets / dsp_model_store / shakespeare / kinglear.txt-00000-of-00001 


Access 

Not public 

Type 

text/plain 

Size 

436.69 KB 

Created 

December 14, 2019 at 8:45:30 AM UTC-8 

Last modified 

December 14, 2019 at 8:45:30 AM UTC-8 


URI 

Link URL 


FIGURE 7.2: The resulting file on Google Storage. 


gs://dsp_model_store/shakespeare/kinglear.txt-000O0-of-GQ001 IQ 

https://storage.cloud.google.com/dsp_model_store/shakespeare/kinglear.t> Hi 
0O000-of-00001 


access to the modules needed to perform operations. We’ll cover 
this issue in the next section, since the process function in this 
example does not use any modules. In general, process functions 
should only make use of modules dehned in the -init function, the 
passed in elcments, and any side inputs that are provided to the 
model. We won’t cover side inputs in this chapter, which provide a 
way of passing addition data to DoFn operations, and instead we’ll 
load the model from Cloud Storage when the class is instantiated. 
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After running the pipeline, the appended text will be available on 
Cloud Storage. You can validate that the pipeline was successful 
by browsing to the bucket in the GCP console, as shown in Figure 
7.2. The resuit is saved as a single file, but for larger outputs the 
resuit will be split into multiple hies, where the best practice is 
to use Avro or Parquet formats. The Avro format works well with 
Datahow, because data gets streamed between different steps in 
the pipeline, even when running in batch mode, and results can 
be written to storage once they are ready on each worker machine. 
Unlike Spark, where stages with dependencies are not executed 
concurrently, steps in a Datahow workhow with dependencies can 
execute simultaneously. 

While it’s possible to run pipelines from Jupyter, which is use- 
ful for learning how pipelines work, it’s more common to use a 
text editor or IDE to create Python hies that are executed via 
the command line. The hrst command below shows how to run 
the example pipeline with the Direct Runner, and the second com¬ 
mand shows how to run the pipeline on Cloud Datahow using the 
runner parameter. We also need to include a staging location on 
Cloud Storage for Datahow to manage the job and specify our 
GCP project name. 

# run locally 
python3 append.py \ 

# run managed 
python3 append.py \ 

--runner DataflowRunner \ 

--project your_project_name \ 

--temp_location gs: //dsp_model_store/tmp/ 


By default, Beam pipelines run as a batch process. To run in 
streaming mode, you need to pass the streaming hag, which wc’ll 
cover in the next chapter. The resuit of running the workhow on 
Datahow is shown in Figure 7.1. You can view the progress of your 
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workflows on GCP by browsing to the console and navigating to 
the Dataflow view, which will show a list of jobs. 

We now have hands on experience with building a data pipclinc 
using Apache Beam, and have run the pipeline locally and in a 
managed cloud environment. The next step is to use the process 
f nn et,ion to apply an ML model to the passed in data set. 


7.2 Batch Model Pipeline 

Cloud Dataflow provides a useful framework for scaling up sklearn 
models to massive data sets. Instead of fitting all input data into 
a dataframe, we can score each record individually in the process 
function, and use Apache Beam to strearn these outputs to a data 
sink, sucli as BigQuery. As long as we have a way of distributing 
our model across the worker nodes, we can use Dataflow to perform 
distributed model application. Tliis can be achieved by passing 
model objects as side inputs to operators or by reading the model 
frorn persistent storage such as Cloud Storage. In this section we’ll 
first train a linear regression model using a Jupyter environment, 
and then store the results to Cloud Storage so that we can run the 
model on a large data set and save the predictions to BigQuery 
and Cloud Datastore. 

7.2.1 Model Training 

The modeling task that we’ll be performing is predicting the birth 
weight of a cliild given a number of factors, using the Natality 
public data set. To build a model with sklearn, we can sample 
the data set before loading it into a Pandas dataframe and fitting 
the model. The code snippet below shows how to sample the data 
set from a Jupyter notebook and visualize a subset of records, as 
shown in Figure 7.3. 
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year 

plurality 

apgar_5min 

mother_age 

father_age 

gestation_weeks 

ever_born mother_married 

weight 

0 

2005 

1.0 

9.0 

34 

38 

41 

10 1 

8.628893 

1 

2005 

1.0 

6.0 

36 

39 

34 

8 1 

2.678616 

2 

2006 

1.0 

9.0 

38 

41 

41 

8 1 

11.062796 

3 

2007 

2.0 

9.0 

42 

42 

38 

8 1 

5.436599 

4 

2007 

1.0 

8.0 

38 

43 

31 

8 1 

3.560466 


FIGURE 7.3: The sampled Natality data set for training. 


from google.cloud import bigquery 
client = bigquery.Ciient () 

sql = . 

SELECT year, plurality, apgar_5min, 
mother_age, father_age, 
gestation_weeks, ever_born 
,case when mother_married = true 

then 1 else 0 end as mother_married 
,weight_pounds as weight 
FROM 'bigquery-public-data.samples.natality' 
order by rand() 
limit 10000 


natalityDF = client.query (sql) .to_dataframe (). fillna (0) 
natalityDF.head() 


Once we liave the data to train on, we can use the LinearRegression 
class in sklearn to fit a model. We’ll use the full dataframe for 
fitting, because the holdout data is the rest of the data set that 
was not sampled. Once trained, we can use pickie to serialize the 
model and save it to disk. The last step is to move the model file 
from local storage to cloud storage, as shown below. We now have 
a model trained that can be used as part of a distributed model 
application workflow. 
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from sklearn.Iinear_model import LinearRegression 
import pickle 

from google.cloud import storage 

# fit and pickle a model 
model = LinearRegression () 

model .fit(natalityDF.iloc[:,1:8], natalityDF[' weight'] ) 
pickle.dump (model, open ( "natality.pkl" , 'wb')) 

# Save to GCS 

bucket = storage.Cii ent(). get_bucket ( 1 dsp_model_store 1 ) 
blob = bucket.blob (' natality/sklearn-linear' ) 
blob.upload_from_filename ('natali ty. pkl') 


7.2.2 BigQuery Publish 

We’ll start by building a Bearn pipeline that reads in data from 
BigQuery, applies a model, and then writes the results to BigQuery. 
In the next section, we’ll add Cloud Datastore as an additional 
data sink for the pipeline. Tliis pipeline will be a bit more complex 
than the prior example, because we need to use multiple Python 
modules in the process function, which requires a bit more setup. 

We’ll walk through different parts of the pipeline this time, to pro¬ 
vide additional details about eacli step. The first task is to dehne 
the libraries needed to build and execute the pipeline. We are also 
importing the json module, because we need this to create the 
schema object that specihes the structure of the output BigQuery 
table. Like the past section, we are stili sampling the data set to 
make sure our pipeline works before ramping up to the complete 
data set. Once we’re confident in our pipeline, we can remove the 
limit command and autoscale a cluster to complete the workload. 


import apache_beam as beam 
import argparse 

from apache_beam.options.pipeline_options import PipelineOptions 
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from apache_beam.options.pipeline_options import SetupOptions 
from apache_beam.io.gcp.bigquery import parse_table_schema_from_json 
import json 

query = . 

SELECT year, plurality, apgar_5min, 
mother_age, father_age, 

gestation_weeks, ever_born 
,case when mother_married = true 

then 1 else 0 end as mother_married 
,weight_pounds as weight 
,current_timestamp as time 
,GENERATE_UUID() as guid 
FROM 'bigquery-public-data.samples.natality' 
rand() 
limit 100 

II II II 


Next, we’ll define a DoFn class that implements the process function 
and applies the sklearn model to individual records in the Natality 
data set. One of the changes from before is that we now have an 
init function, which we use to instantiate a set of fields. In order to 
have references to the modules that we need to use in the process 
function, we need to assign these as fields in the class, otherwise 
the references will be undefined when running the function on dis- 
tributed worker nodes. For example, we use seif._pd to refer to 
the Pandas module instead of pd. For the model, we’ll use lazy ini- 
tialization to fetch the model from Cloud Storage once it’s needed. 
While it’s possible to implement the setup function dehned by the 
DoFn interface to load the model, there are limitations on which 
runners call this function. 


class ApplyDoFn (beam.DoFn): 

def _ init_ (self): 

self._model = None 
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from google.cloud import storage 

import pandas as pd 

import pickle as pkl 

self._storage = storage 

self._pkl = pkl 

self._pd = pd 

def process (self, element): 
if self._model is None: 

bucket = self._storage.Ciient (). get_bucket ( 

'dsp_model_store' ) 

blob = bucket.get_blob (' natality/sklearn-linear' ) 
self._model =self._pkl.loads (blob .download_as_stri ng( )) 


new_x = self._pd.DataFrame.from_dict (element, 

orient = "index" ) .transpose (). filina (0) 
weight = self._model.predict (new_x.iloc[:,1:8])[0] 
return [ { 1 guid 1 : element ['guid'] , 'weight': weight, 

'time': str (element ['time'] ) } ] 

Once the model object has been lazily loaded in the process func- 
tion, it can be used to apply the linear regression model to the 
input record. In Dataflow, records retrieved from BigQuery are re- 
turned as a collection of dictionary objects and our process function 
is responsible for operating on each of these dictionaries indepen- 
dently. We first convert the dictionary to a Pandas dataframe and 
then pass it to the model to get a predicted weight. The process 
function returns a list of dictionary objects, wliich describe the 
results to write to BigQuery. A list is returned instead of a dictio¬ 
nary, because a process function in Beam can return zero, one, or 
multiple objects. 

An example element object passed to process function is shown in 
the listing below. The object is a dictionary type, where the keys 
are the column names of the query record and the values are the 
record values. 


1.2 Batch Model Pipeline 


193 


{'year': 2001, 'plurality' : 1, ' apgar_5min' : 99, 'mother_age' : 33, 

'father_age' : 40, 'gestation_weeks' : 38, 'ever_born': 8, 

'mother_married' : 1, 'weight': 6.8122838958, 

'time 1 : '2019-12-14 23:51:42.560931 UTC' , 

'guid' : 'b281c5e8-85b2-4cbd-a2d8-e501ca816363' } 

To save the predictions to BigQuery, we need to define a schema 
tliat defines the structure of the predictions table. We can do this 
using a utility function that converts a JSON description of the ta¬ 
ble schema into the schema object required by the Beam BigQuery 
writer. To simplify the process, we can create a Python dictionary 
object and use the dumps command to generate JSON. 


schema = parse_table_schema_from_json(json.dumps ({' fields' : 


[ { 

' name' : 

'guid' , 

'type' : 

STRING'} , 

{ 

' name' : 

'weight 

, 'type' 

'FL0AT64' }, 

{ 

' name' : 

'time' , 

'type' : 

STRING'} ]})) 


The next step is to create the pipeline and define a DAG of Beam 
operations. This time we are not providing input or output argu- 
ments to the pipeline, and instead we are passing the input and 
output destinations to the BigQuery operators. The pipeline has 
three steps: read from BigQuery, apply the model, and write to 
BigQuery. To read from BigQuery, we pass in the query and spec- 
ify that we are using Standard SQL. To apply the model, we use 
our custom class for making predictions. To write the results, we 
pass the schema and table name to the BigQuery writer, and spec- 
ify that a new table should be created if necessary and that data 
sliould be appended to the table if data already exists. 

# set up pipeline options 
parser = argparse.ArgumentParser () 

known_args, pipeline_args = parser.parse_known_args (None) 
pipeline_options = PipelineOptions (pipeline_args) 
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Query results i save results jajj explore with data studio 


Query complete (0.6 sec elapsed, 0 B processed) 

Job information Results JSON Execution details 

Row 

guid 

weight 

time 

1 

105f76ae-9c9c-466b-a61 d-c911 fcd0a449 

7.4475855859655615 

2019-12-14 23:47:13.529691 UTC 

2 

425a31f5-9d01-4ae2-9996-82c27ad56143 

7.93343406915932 

2019-12-14 23:47:13.529691 UTC 

3 

dbc71373-7cfd-48e2-8cd3-e6c702587e7e 

7.378916852275836 

2019-12-14 23:47:13.529691 UTC 

4 

a4b3bb7d-1 d8c-420f-9a23-063d6ff184f7 

8.18444827748126 

2019-12-1423:47:13.529691 UTC 

5 

d8119fe0-cd4c-41 f2-b13c-23bdfca302a2 

7.699117579767197 

2019-12-14 23:47:13.529691 UTC 

6 

f6132482-5883-472c-90e0-dedb4b0bfb12 

7.406936963326817 

2019-12-14 23:47:13.529691 UTC 


FIGURE 7.4: The Natality predictions table on BigQuery. 


# define the pipeline steps 
p = beam.Pipeline (options=pipeline_options) 
data = p | ' Read from BigQuery' >> beam.io.Read ( 

beam.io.BigQuerySource (query=query, use_standard_sql=True)) 
scored = data | 'Apply Model' >> beam.ParDo(ApplyDoFn ()) 
scored | 'Save to BigQuery' >> beam.io.Write(beam.io.BigQuerySi nk( 

'weight_preds' , 'dsp_demo', schema = schema, 
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED, 
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)) 


The last step in the script is running the pipcline. While it is pos- 
sible to run this complete code listing from Jupyter, the pipeline 
will not be able to complete because the project parameter needs 
to be passed as a command line argnment to the pipeline. 

# run the pipeline 
resuit = p.run () 
result.wait_until_finish() 

Before running the pipeline on Dataflow, it’s a best practice to 
run the pipcline locally with a subset of data. In order to run 
the pipcline locally, it’s necessary to specify the GCP project as a 
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command line argument, as shown below. The project parameter 
is needed to read and write data with BigQuery. After running 
the pipeline, you can validate that the workflow was successfnl 
by navigating to the BigQuery UI and checking for data in the 
destination table, as shown in Figure 7.4. 

To run the pipeline on Cloud Dataflow, we need to pass a param¬ 
eter that identihes the Dataflow Runner as the execution engine. 
We also need to pass the project narne and a staging location on 
Cloud Storage. We now pass in a requirements hle that identihes 
the googie-cloud-storage library as a dependency, and set a cluster 
size limit using the max workers parameter. Once submitted, you 
can view the progress of the job by navigating to the Dataflow Ul 
in the GCP console, as shown in Figure 7.5. 

# running locally 

python3 apply.py --project your_project_name 


# running on GCP 

echo $ 'google-cloud-storage==l.19.0' > reqs.txt 
python3 apply.py \ 

--runner DataflowRunner \ 

--project your_project_name \ 

--temp_location gs: //dsp_model_store/tmp/ \ 

--requirements_file reqs.txt \ 

--maxNumWorkers 5 

We can now remove the limit command from the query in the 
pipeline and scale the workload to the full dataset. When running 
the full-scale pipeline, it’s useful to keep an eye on the job to 
make sure that the cluster size does not scale beyond expectations. 
Setting the maximum worker count helps avoid issues, but if you 
forget to set this parameter than the cluster size can quickly scale 
and resuit in a costly pipeline run. 

One of the potential issues with using Python for Dataflow 
pipelines is that it can take awhile to initialize a cluster, because 
each worker node will install the required libraries for the job from 
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= LOGS 

Job 

Read from BigQuery 

Running 

Autoscaling © 

Workers 

Current state 

2 

Worker pool started. 





Dec 14. 2019 4:53 PM 

Apply Model 

Running 




Save to BigQuery v 

Running 

,/ : 

4:55 

5 PM 



# Current workers: 

# Target workers: 



FIGURE 7.5: Running the managed pipeline with autoscaling. 


source, which can take a significant amonnt of time for libraries 
such as Pandas. To avoid lengthy startup delays, it’s helpfnl to 
avoid including libraries in the requirements file that are already 
included in the Dataflow SDK 1 . For example, Pandas 0.24.2 is in- 
clnded with SDK version 2.16.0, which is a recent enough version 
for this pipeline. 

One of the usefnl aspects of Cloud Dataflow is that it is fully man¬ 
aged, which means that it handles provisioning hardware, deals 
with failures if any issues occur, and can autoscale to match de- 
mand. Apache Bearn is a great framework for data scientists, be- 
cause it enables using the same tool for local testing and Google 
Cloud deployments. 

7.2.3 Datastore Publish 

Publishing results to BigQuery is usefnl for ETLs and other appli- 
cations that are part of batch data pipelines. However, it doesn’t 
work well for use cases where applications need to retrieve a pre- 
diction for users with low latency. GCP provides two NoSQL 
databases that provide a solution for this use case, where you need 


Kttps ://cloud.google.com/dataflow/docs/concepts/sdk-worker-dependencies 
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Save to BigQuery 

V 

Create entities 

Running 


Running 


FIGURE 7.6: Publishing to BigQuery and Datastore. 


to retrieve information for a specific user witli minimal latency. In 
this section we’ll explore Cloud Datastore, which provides some 
querying capabilities for an application database. 

We’ll bnild npon our prior pipeline and add an additional step 
that pnblishes to Datastore while also publishing the predictions to 
BigQuery. The resulting DAG is shown in Figure 7.6. The approach 
we’ll use will write each entity to Datastore as part of the process 
function. This is much slower than writing all of the entities as a 
single Bearn operation, but the Beam operator that performs this 
step stili requires Python 2. 

To update the pipeline, we’ll dehne a new DoFn class and add this 
as the last step in the pipeline, as shown in the code snippet be- 
low. The -init function loads the datastore module and makes it 
referenceable as a held. The process function creates a key object 
that is used to index the entity we want to store, which is similar 
to a dictionary object. A Datastore entity is the base object used 
to persist state in a Datastore database. Wc’ll use the guid as a 
unique index for the entity and assign weight and time attributes 
to the object. Once we’ve set the attributes, we use the put com- 
mand to persist the object to Cloud Datastore. This last step can 
take some time, which is why it is better to return a collection of 
entities and perform the put step as part of a batch operation, if 
supported. 
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Entities □ create entity i import i export g delete 

QUERY BY KIND QUERY BY GQL 

Kind 

natality-guid =■ FILTER ENTITIES 


□ 

Name/ID 

time 

weight 

□ 

name=0046cdef-6a0f-4586-86ec-4b995cfc7. 

2019-12-15 03:00:06.319496 UTC 

7.9434742419056 

□ 

name=077e7597-797f-4554-a6d3-01de3a37... 

2019-12-15 03:00:06.319496 UTC 

7.682624427455222 

□ 

name=08de8886-0cac-4633-ad6e-7b693870 

2019-12-15 03:00:06.319496 UTC 

6.454754420832512 

□ 

name=0f7e9b50-3cc9-4c36-9dde-21122d53... 

2019-12-15 03:00:06.319496 UTC 

7.803163739120405 

□ 

name=10018d70-4d3c-4874-b262-58b982f5. 

2019-12-15 03:00:06.319496 UTC 

7.556760799319306 


FIGURE 7.7: The resulting entities in Clond Datastore. 


class PublishDoFn (beam.DoFn): 

def _init_ (self): 

from google.cloud import datastore 
self._ds = datastore 

def process (self, element): 

Client = self._ds.Ciient () 

key = client.key (' natality-guid' , element [' guid '] ) 

entity = self._ds.Entity (key) 

entity[ 'weight' ] = element ['weight' ] 

entity [ 'time' ] = element ['time' ] 

client.put(entity) 

scored | 'Create entities' >> beam.ParDo ( PublishDoFn ()) 

We can now rerun the pipeline to publish the predictions to botli 
BigQuery and Datastore. To validate that the model ran success- 
fnlly, yon can navigate to the Datastore UI in the GCP console 
and inspect the entities, as shown in Figure 7.7. 
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from google.cloud import datastore 

client = datastore.Ciient () 

query = client.query (kind= 'natality-guid' ) 

query_iter = query.fetch () 
for entity in query_iter: 
print (entity) 
break 


It’s also possiblc to fetch the predictions published to Dataflow us- 
ing Python. The code snippet above shows how to fetch all of the 
model predictions and print out the hrst entity retrieved. A sam- 
ple entity is shown below, which contains the guid as the nnique 
identiher and additional attributes for the weight and time of the 
model prediction. 


<Entity ( 'natality-guid' , '0046cdef-6aOf-4586-86ec-4b995cfc7c4e' ) 

{ 1 weight 1 : 7 . 9434742419056, 

'time 1 : '2019-12-15 03:00:06.319496 UTC'}> 

We now have a Dataflow pipcline that can scale to a large data 
set and output predictions to analytical and application databases. 
This means that other Services can fetch these predictions to per- 
sonalize products. Batch pipelines are one of the most common 
ways that I’ve seen ML models prodnctized in the gaming indus- 
try, and Dataflow provides a great framework for enabling data 
scientists to own more of the model production process. 


7.3 Conclusion 

Dataflow is a powerful data pipeline tool that enables data scien¬ 
tists to rapidly prototype and deploy data processing workflows 
that can apply machine learning algorithms. The framework pro¬ 
vides a few basic operations that can be chained together to dehne 
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complex graphs of workflows. One of the key features of Dataflow 
is that it builds upon an open source library called Apache Beam 
that enables the workflows to be portable to other cloud environ- 
ments. 

In this chapter we built a batch model pipeline that fetched data 
from BigQuery, applied a linear regression model, and then per- 
sisted the predictions to BigQuery and Cloud Datastore. I 11 the 
next chapter we’ll explore a streaming version of this pipeline and 
reuse portions of the current pipeline. 


8 


Streaming Model WorkEows 


Many organizations are now using streaming platforms in order 
to build real-time data pipelines that transform streams of data 
and move data between different components in a cloud envi- 
ronment. These platforms are typically distributed and provide 
fanlt-tolerance for streaming data. In addition to connecting dif¬ 
ferent systems together, these tools also provide the ability to store 
records and create event queues. 

One of the rnost popnlar streaming platforms is Apache Kafka, 
which is an open-source solntion for providing message streaming 
across public and private clouds. Kafka is a hosted solntion that 
requires provisioning and managing a cluster of machines in order 
to scale. GCP provides a fully-managed streaming platform called 
PubSub and AWS provides a managed solution called Kinesis. The 
best option to use depends on your cloud platform, throughput and 
latency requirements, and DevOps concerns. 

With a streaming platform, you can pass data between different 
components in a cloud environment, as well as external systems. 
For example, many game companies are now using these platforms 
to collect gameplay events from mobile games, where an event is 
transmitted from the game client to a game server, and then passed 
to the data platform as a stream. The message producer in this 
case is the game server passing the message to the consumer, which 
is the data platform that transforms and Stores the event. 

The connection to data Science in production is that streaming 
platforms can be used to apply ML rnodels as a transform step in 
a streaming pipeline. For example, you can set up a Python process 
that reads in messages from a topic, applies a sklearn model, and 
outputs the prediction to a new topic. This process can be part of 
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a larger workflow that provides real-time ML predictions for users, 
such as item recommendations in a mobile garne. For the model 
application step to scale to large volumes of messages, wc’ll need to 
use distributed systems such as Spark and Cloud Dataflow rather 
than a single Python process. 

While the model application step in a streaming model pipeline is 
similar to setting up a Lambda or Cloud Function, wliich already 
provides near real-time predictions, a key difference is the ease 
of integrating with other components in the cloud platform. For 
example, with Cloud Dataflow you can route the event to BigQuery 
for storage as well as the model application step, which may push 
the output to a new message consumer. Another benefit is that it 
enables using distributed tools such as PySpark to handle requests 
for model application, versus the endpoint based approaches that 
Service requests in isolation. 

One of the benefits of using messaging systems in a cloud plat¬ 
form is that it enables different tools and different programming 
languages to communicate using standardized interfaces. We’ll fo¬ 
cus on Python and PySpark in this book, but Java, Go, and many 
other languages are supported by these platforms. In this chap- 
ter, we’ll hrst use Apache Kafka to pass messages between differ¬ 
ent Python processes and then consume, transform, and produce 
new messages using PySpark Streaming. Next, we’ll use PubSub 
on GCP to provide near real-time model predictions using Cloud 
Dataflow in streaming mode. 


8.1 Spark Streaming 

Streaming data sets have been supported in Spark since version 
0.7, but it was not until version 2.3 that a low-latency mode called 
Structured Streaming was released. With structured streaming, 
continuous processing can be used to achieve millisecond laten- 
cies when scaling to high-volume workloads. The general flow with 
structured streaming is to read data frorn an input stream, such 
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as Kafka, apply a transformation using Spark SQL, Dataframe 
APIs, or UDFs, and write the results to an output stream. Spark 
Streaming also works with managed streaming platforms including 
PubSub and Kinesis, and other frameworks in the Apache ecosys- 
tem including Flume. 

In this section, we’ll first set up a Kafka instance and then produce 
and consume messages on the same machine using Kafka. Next, 
we’ll show how to consume messages frorn Kafka using the read- 
stream function in PySpark, and then build a streaming pipelinc 
that applies a sklearn model. 

8.1.1 Apache Kafka 

Kafka is an open-source streaming platform that was incubated 
at Linkedln. It is designed to handle real-time data streams that 
are high throughput and low latency. It is written in Java and 
Scala, but supports a range of programming languages for pro- 
ducing and consuming streams through standardized APIs. The 
platform can scale to large data sets by using horizontal scaling 
and partitioning to distribute workloads across a cluster of servers 
called brokers. Wliile open-source Kafka is a liosted solution for 
message streaming, some cloud providers now offer fully-managed 
versions of Kafka, such as Amazon’s MSK offering. 

To show how Kafka can be integrated into a streaming workflow, 
we’ll use a single-node setup to get up and running. For a produc- 
tion environment, you’11 want to set up a multi-node cluster for 
redundancy and improved latency. Since the focus of this chap- 
ter is model application, we won’t dig into the details of setting 
up Kafka for high-availability, and instead recommend managed 
Solutions for small teams getting started. To install Kafka, it’s use- 
ful to browse to the website 1 and hnd the most recent release. In 
order to install Kafka, we first need to install Java, and then down- 
load and extract the Kafka release. The steps needed to set up a 
single-node Kafka instance on an EC2 machine are shown in the 


Kttps : / /kafka .apache. org/qui ckstart 
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snippet below. We’ll also install a library for working with Kafka 
in Python called kafka-python. 


sudo yum install -y java 

pip install —user kafka-python 

wget http://mirror. reverse.net/pub/apache/kafka/2.4.0/ 

kafka_2.12-2.4.0.tgz 

tar -xzf kafka_2.12-2.4.0.tgz 
cd kafka_2.12-2.4.0 

bin/zookeeper-server-start.sh config/zookeeper.properties 

# new terminal 

bin/kafka-server-start.sh config/server.properties 

# new terminal 

bin/kafka-topics.sh --create --bootstrap-server localhost:9092 

--replication-factor 1 --partitions 1 --topic dsp 


# output 

[2019-12-18 10:50:25] INFO Log partition=dsp-0, di r=/tmp/kafka-logs 
Completed load of log with 1 segments, log start offset 0 and 
log end offset 0 in 56 ms (kafka.log.Log) 


When setting np Kafka, we’ll need to spawn three separate pro- 
cesses to run dependencies, start the Kafka Service, and create a 
new topic for publishing messages. The snippet above runs the 
following processes: 

• Zookeeper: An Apache project that provides conhguration and 
Service discovery for distributed Systems. 

• Kafka Lannches the bootstrap Service that enables setting np 
Kafka topics and using streaming APIs. 

• Topics: Creates a new topic called “dsp”. 

The Zookeeper and Kafka tasks are long-rnnning processes that will 
continue to execute until terminated, wliile the Topi cs process will 
shutdown once the new Kafka topic is set np. The output at the 
bottom of the snippet shows the output from running tliis com- 
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mand, which will be displayed in terminal running the Kafka pro- 
cess. In this configuration, we are setting np a single partition for 
the topic with no replication. We now have a single-node Kafka 
cluster set np for testing message streaming. 

The first API that we’ll explore is the Producer API, which enables 
processes to publish a message to a topic. To publish a message to 
our Kafka server, we create a producer object by passing in an IP 
address and a serialization function, which specihes how to encode 
Python objects into strings that can be passed to the Kafka server. 
The Python snippet below shows how to create the producer and 
send a dictionary object as a message to the server, publishing 
the message to the dsp topic. The dict object contains hello and 
time keys. If we run this code, the message should be successfully 
transmitted to the server, but there will not yet be a consumer to 
process the message. 

from kafka import KafkaProducer 
from json import dumps 
import time 


producer = KafkaProducer (bootstrap_servers= ['localhost:9092 '], 
value_serializer=lambda x: dumps (x) .encode (' utf-8 ')) 

data = {'hello' : 'world', 'time': time.time()} 
producer.send (' dsp' , data) 


To set up a process for consuming the message, we’ll explore the 
consumer API, which is used to read in streams of data. The Python 
snippet below shows how to create a consumer object that connects 
to the Kafka server and subscribes to the dsp topic. The consumer 
object returned is iterable and can be used in combination with a 
for loop in order to process messages. In the example below, the 
for loop will suspend execution until the next message arrives and 
continue iterating until the process is terminated. The value ob¬ 
ject will be a Python dictionary that we passed from the producer, 
while the deseri ali zer function dehnes how to transform strings to 
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Python objects. This approach works fine for small-scale streams, 
but with a larger data volume we also want to distribute the mes- 
sage processing logic, which we’ll demonstrate with PySpark in the 
next section. 

from kafka import KafkaConsumer 
from json import loads 


consumer = KafkaConsumer (' dsp' , 

bootstrap_servers= ['localhost:9092'] , 

value_deserializer=lambda x: loads (x . decode( 1 utf-8 '))) 


for x in consumer: 
print (x.value) 

Now that we have Python Scripts for producing and consuming 
messages, we can test message streaming with Kafka. First, run 
the Consumer script in a Jupyter notebook, and then run the Pro- 
ducer script in a separate notebook. After running the producer 
cell multiple times, you should see output from the consumer cell 
similar to the results shown below. 

{ 1 hello' : 1 world 1 , 'time': 1576696313.876075} 

{ 'hello' : 'world', 'time': 1576696317.435035} 

{'hello': 'world', 'time': 1576696318.219239} 

We can now use Kafka to reliably pass messages between differ¬ 
ent components in a cloud deployment. While this section used a 
test configuration for spinning up a Kafka Service, the APIs we 
explored apply to production environments with much larger data 
volumes. In the next section, we’ll explore the streams API which 
is used to process streaming data, sucli as applying an ML model. 

8.1.2 Sklearn Streaming 

To build an end-to-end streaming pipcline with Kafka, we’ll lever- 
age Spark streaming to process and transform data as it arrives. 
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The structured streaming enhancements introduced in Spark 2.3 
enable working with dataframes and Spark SQL while abstracting 
away many of the complexities of dealing with batching and Pro¬ 
cessing data sets. In this section we’ll set np a PySpark streaming 
pipeline that fetches data from a Kafka topic, applies a sklearn 
model, and writes the output to a new topic. The entire workflow 
is a single DAG that continnously runs and processes messages 
from a Kafka Service. 

In order to get Kafka to work with Databricks, we’ll need to edit 
the Kafka configuration to work with external connections, since 
Databrics runs on a separate VPC and potentially separate cloud 
than the Kafka Service. Also, we previously used the bootstrap 
approach to refer to brokers using locaihost as the IP. On AWS, 
the Kafka startup script will use the internal IP to listen for con- 
nection, and in order to enable connections from remote machines 
we’ll need to update the configuration to use the external IP, as 
shown below. 


vi config/server.properties 

adverti sed.Iisteners = PLAINTEXT ://{external_ip}:9092 


After making this configuration change, you’ll need to restart 
the Kafka process in order to receive inbound connections from 
Databricks. You’ll also need to enable inbound connections from 
remote machines, by modifying the security group, which is cov- 
ered in Section 1.4.1. Port 9092 needs to be open for the Spark 
nodes that will be making connections to the Kafka Service. 

We’ll also set up a second topic, which is used to publish the results 
of the model application step. The PySpark workflow we will set 
up will consume messages from a topic, apply a sklearn model, and 
then write the results to a separate topic, called preds. One of the 
key benefits of this workflow is that you can swap out the pipeline 
that makes predictions without impacting other components in 
the System. This is similar to components in a cloud workflow 
calling out to an endpoint for predictions, but instead of changing 
the configuration of components calling endpoints to point to new 
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endpoints, we can seamlessly swap in new backend logic without 
impacting other components in the workflow. 


bin/kafka-topics.sh --create --bootstrap-server localhost:9092 

--replication-factor 1 --partitions 1 --topic preds 

It’s a good practice to start with a basic workflow that simply 
consumes messages before worrying about how to build out a pre- 
dictive modeling pipeline, especially when working with streaming 
data. To make sure that we’ve correctly set up Kafka for remote 
connections with Databricks, we can author a minimal script that 
consumes messages frorn the stream and outputs the results, as 
shown in the PySpark snippet below. Databricks will refresh the 
output on a regular interval and show new data in the output table 
as it arrives. Setting the startingoffsets value to earliest means 
that we’ll backload data from the last Kafka checkpoint. Remov- 
ing tliis setting will mean that only new messages are displayed in 
the table. 

df = spark .readStream.format (" kafka" ) 

.option ( "kafka.bootstrap.servers" , "{external_ip}:9092") 

.option ( "subseribe" , "dsp") 

.option("starti ngOf fsets" , "earliest"). load() 
display (df) 


Getting Databricks to communicate with the Kafka Service can 
be one of the main challenges in getting this sample pipeline to 
work, which is why I recommend starting with a minimal PySpark 
script. It’s also useful to author simple UDFs that process the 
value field of the received messages to ensure that the decoded 
message in PySpark matehes the encoded data from the Python 
process. Once we can consume messages, we’ll use a UDF to apply 
a sklearn model, where UDF refers to a Python function and not 
a Pandas UDF. As a general practice, it’s good to add checkpoints 
to a Spark workflow, and the snippet above is a good example for 
checking if the data received matehes the data transmitted. 
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For the Spark streaming example, we’ll again use the Games data 
set, which lias ten attributes and a label column. In this workflow, 
we’ll send the feature vector to the streaming pipeline as input, 
and output an additional prediction column as the output. We’ll 
also append a unique identiher, as sliown in the Python snippet 
below, in order to track the model applications in the pipeline. 
The snippet below shows how to create a Python dict with the ten 
attributes needed for the model, append a GUID to the dictionary, 
and send the object to the streaming model topic. 

from kafka import KafkaProducer 
from json import dumps 
import time 
import uuid 


producer = KafkaProducer(bootstrap_servers= [' {external_ip}:9092' ] , 
value_serializer=lambda x: dumps (x) .encode (' utf-8 1 )) 

data = { 'G1' : 1, 'G2 1 : 0, 1 G3 1 : 0, 1 G4 1 : 0, 1 G5 1 : 0, 

'G6' : 0, 'G7' : 0, 1 G8 1 : 0, 1 G9 1 : 0, 1 G10 1 : 0, 

1 User_ID' : str(uuid.uuidl())} 
resuit = producer.send (' dsp , data) 
resuit. get( ) 


To implement the streaming model pipeline, we’ll use PySpark 
with a Python UDF to apply model predictions as new elements 
arrive. A Python UDF operates on a single row, wliile a Pandas 
UDF operates on a partition of rows. The code for this pipeline 
is shown in the PySpark snippet below, which hrst trains a model 
on the driver node, sets up a data sink for a Kafka strearn, dehnes 
a UDF for applying an ML model, and then publishes the scores 
to a new topic as a pipeline output. 


from pyspark.sql.types import StringType 

import json 

import pandas as pd 
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from sklearn.Iinear_model import LogisticRegression 

# build a logistic regression model 

gamesDF = pd.read_csv ("https://github.com/bgweber/Twitch/raw/ 

master/Recommendations/games-expand.csv" ) 
model = LogisticRegression () 

model.fit (gamesDF.iloc[:,0:10], gamesDF['label']) 

# define the UDF for scoring users 
def score(row) : 

d = json.loads (row) 

p = pd.DataFrame.from_dict (d, orient = "index" ) .transpose( ) 
pred = model.predict_proba (p.iloc[:,0:10]) [0][0] 
resuit = {'User_ID': d [ 'User_ID'] , 'pred': pred } 
return str(json.dumps (resuit)) 

# read from Kafka 

df = spark.readStream.format (" kafka" ) 

.option (" kafka.bootstrap.servers" , "{external_ip}:9092") 
,option("subscribe", "dsp") .loadQ 

# select the value field and apply the UDF 
df = df.selectExpr ( "CAST(value AS STRING)") 
score_udf = udf(score, StringType ()) 

df = df.select( score_udf( "value" ). ali as ( "value" )) 

# Write results to Kafka 

query = df.writeStream.format (" kafka" ) 

.option(" kafka.bootstrap.servers" , "{external_ip}:9092") 

.option ( "topic" , "preds") 

.option ( "checkpointLocation" , "/temp" ) .start () 


The script first trains a logistic regression model using data fetched 
from GitHub. The model object is created on the driver node, but 
is copied to the worker nodes when used by the UDF. The next 
step is to define a UDF that wc’ll apply to streaming records in the 
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pipeline. The Python UDF takes a string as input, converts the 
string to a dictionary using the j son library, and then converts the 
dictionary into a Pandas dataframe. The dataframe is passed to 
the model object and the UDF returns a string representation of a 
dictionary object with user_iD and pred keys, where the prediction 
value is the propensity of the user to purchase a specific game. 

The next three steps in the pipeline define the PySpark streaming 
workflow. The readstream call sets up the connection to the Kafka 
broker and subscribes to the dsp topic. Next, a select statement 
is used to cast the value column of streaming records to a string 
before passing the value to the UDF, and then creating a new 
dataframe using the resuit of the Python UDF. The last step writes 
the output dataframe to the preds topic, using a local directory as 
a checkpoint location for Kafka. These three steps run as part of a 
continuous processing workflow, where the steps do not complete, 
but instead suspend execution until new data arrives. The resuit 
is a streaming DAG of operations that processes data as it arrives. 

When running a streaming pipeline, Databricks will show details 
about the workflow below the cell, as shown in Figure 8.1. The 
green icon identifies that this is a streaming operation that will 
continue to execute until terminated. There are also charts that 
visualize data throughput and latency. For a production pipeline, 
it’s useful to run code using orchestration tools such as Airflow 
with the Databricks operator, but the notebook environment does 
provide a useful way to run and debug streaming pipelines. 

Now that we are streaming model predictions to a new topic, we’ll 
need to create a new consumer for these messages. The Python 
snippet below shows to consume messages from the broker for the 
new predictions topic. The only change from the prior consumer 
is the IP address and the deserializer function, which no longer 
applies an encoding before converting the string to a dictionary. 

from kafka import KafkaConsumer 
from json import loads 
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FIGURE 8.1: Visualizing stream processing in Databricks. 

consumer = KafkaConsumer ( 1 preds' , 

bootstrap_servers = ['{external_ip}:9092'] , 
value_deserializer=lambda x: loads(x)) 

for x in consumer: 
print (x.value) 

We now have everything in place in order to test out the streaming 
pipeline with Spark streaming and Kafka. First, run the PySpark 
pipeline in a Databricks cell. Next, run the consumer script in a 
Jupyter notebook. To complete the workflow, run the producer 
script in a separate Jupyter notebook to pass a message to the 
pipeline. The resuit should be a prediction dictionary printed to 
the console of the consumer notebook, as shown below. 

{'User_ID' : 1 4be94cd4-21e7-llea-ae04-8c8590b3eee6' , 

1 pred 1 : 0.9325488640736544} 

We now have a PySpark streaming pipeline that applies model 
predictions with near real-time performance. There’s additional 
tuning that we can perform to get tliis latency to within one mil- 
lisecond, wliich is useful for a variety of model and web applica- 
tions. The benefit of using Spark to perform model application is 
that we can scale the cluster to match demand and can swap in 
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new pipelines as needed to provide model updates. Spark stream¬ 
ing was initially a bit tricky to get up and rnnning, but the recent 
enhancements have made it much easier to get working with model 
application pipelines. In this pipelinc we used a simple regression 
model, but streaming workflows can also be used for deep learning 
tasks, such as irnage classification 2 . 


8.2 Dataflow Streaming 

In Chapter 7 we explored using Cloud Dataflow to create a batch 
model pipcline, and authored a DoFn to apply a sklearn model. 
Dataflow can also be used to build streaming model pipelines, by 
setting a configuration flag. One of the features of Dataflow is 
tliat many components can be reused across batch and streaming 
workflows, and in this section we’ll use the sarne model application 
class that we dehned in the last chapter. 

When working with Dataflow in streaming mode, you can use a 
combination of streaming and batch data sources and streaming 
data sinks when dehning a DAG. For example, you can use the 
approach frorn the prior section to read from a Kafka strearn, apply 
a model with a DoFn function, and write the results to another 
strearn as a data sink. Some of the GCP systems that work with 
Dataflow, such as BigQuery, can be used as botli a batch and 
streaming data sink. 

In this section, we’ll build a streaming pipeline with Dataflow that 
streams in messages from PubSub, applies a sklearn model, and 
publishes the results to Cloud Datastore. This type of pipcline is 
useful for updating a user prohle based on real-time data, such as 
predicting if a user is likcly to rnake a purchase. 

2 

https://www.youtube.com/watch?v=xwQwKW-cerE 
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8.2.1 PubSub 

PubSub is a fully-managed streaming platform available on GCP. 
It provides similar functionality to Kafka for achieving liigh 
throughput and low latency when handling large volumes of mes- 
sages, but reduces the amount of DevOps work needed to maintain 
the pipcline. One of the benefits of PubSub is that the APIs map 
well to common use cases for streaming Dataflow pipelines. 

One of the differences from Kafka is that PubSub uses separate 
concepts for producer and consumer data sources. In Kafka, you 
can publish and subscribe to a topic directly, while in PubSub con- 
sumers subscribe to subscriptions rather than directly subscribing 
to topics. With PubSub, you first set up a topic and then create 
one or more subscriptions that listen on this topic. To create a 
topic with PubSub, perform the following steps: 

1. Browse to the PubSub UI in the GCP Console 

2. Click “Create Topic” 

3. Enter “natality” for the topic ID, and click “Create Topic” 

The resuit of performing these actions is that we now have a topic 
called natality that we can use for publishing messages. Next, we’ll 
create a subscription that listens for messages on this topic by 
performing these steps: 

1. Click on Subscriptions in the navigation pane 

2. Click “Create Subscription” 

3. Assign a subscription ID “dsp” 

4. Select the “natality” topic 

5. Click “Create” 

We now have a topic and subscription set up for streaming mes¬ 
sages in a pipeline. Before setting up a Dataflow pipcline, we’ll first 
create message consumers and producers in Python. The code snip- 
pet below shows how to read messages from the subscription using 
the Google Cloud library. We first create a subscriber client, then 
set up the subscription and assign a callback function. Unlike the 
Kafka approach, which returns an iterable object, PubSub uses 
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a callback pattern where you provide a function that is used to 
process messages as they arrive. In this example we simply print 
the data field in the message and then acknowledge that the mes- 
sage has been received. The for loop at the bottom of the code 
block is used to keep the script running, because none of the other 
commands suspend when executed. 


import time 

from google.cloud import pubsub_vl 

subscriber = pubsub_vl.SubseriberClient () 
subseription_path = subseriber.subseription_path ( 

"your_project_name" , "dsp") 


def callback(message) : 
print( message.data) 
message. ack( ) 

subseriber.subseribe (subseription_path, callback=callback) 


while True: 

time.sleep (10) 


We’ll use the same library to create a message producer in Python. 
The code snippet below shows how to use the Googlc Cloud library 
to create a publishing client, set up a connection to the topic, and 
publish a message to the dsp topic. For the producer, we need to 
encode the message in utf-s format before publishing the message. 


from google.cloud import pubsub_vl 
publisher = pubsub_vl.PublisherClient () 

topic_path = publisher.topic_path ("your_project_name8" , "natality") 

data = "Helio World !". encode (' utf-8 1 ) 
publisher.pubii sh( topic_path, data=data) 
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To test out the pipcline, first run the consumer in a Jupyter note- 
book and then run the producer in a separate Jupyter notebook. 
The resuit should be that the consumer cell outputs "Helio World" 
to the console after receiving a message. Now that we have tested 
out basic functionality with PubSub, we can now integrate tliis 
messaging platform into a streaming Dataflow pipeline. 

8.2.2 Natality Streaming 

PubSub can be used to provide data sources and data sinks within 
a Dataflow pipeline, where a consumer is a data source and a 
publisher is a data sink. We’ll reuse the Natality data set to create 
a pipeline with Dataflow, but for the streaming version we’ll use a 
PubSub consumer as the input data source rather than a BigQuery 
resuit set. For the output, we’ll publish predictions to Datastore 
and reuse the publish DoFn from the previous chapter. 


import apache_beam as beam 
import argparse 

from apache_beam.options.pipeline_options import PipelineOptions 
from apache_beam.io.gcp.bigquery 

import parse_table_schema_from_j son 
import json 

class ApplyDoFn (beam.DoFn): 

def _init_ (self): 

self._model = None 

from google.cloud import storage 

import pandas as pd 

import pickle as pkl 

import json as js 

self._storage = storage 

self._pkl = pkl 

self._pd = pd 

self._json = js 
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def process (self, element): 
if self._model is None: 

bucket = self._storage.Ciient (). get_bucket ( 

'dsp_model_store 1 ) 

blob = bucket.get_blob ( 1 natality/sklearn-linear' ) 
self._model =self._pkl.loads(blob.download_as_stri ng( )) 


element = self._json.loads(element.decode (' utf-8 ')) 
new_x = self._pd.DataFrame.from_dict (element, 

orient = "index" ) .transpose (). filina (0) 
weight = self._model.predict (new_x.iloc[:,1:8])[0] 
return [ { 1 gu id': element ['guid '] , 'weight': weight, 

'time': str (element ['time '] ) } ] 

The code snippet above shows the function we’ll use to perform 
model application in the streaming pipeline. Tliis function is the 
same as the function we defined in Chapter 7 with one modification, 
the json.loads function is used to convert the passed in string into 
a dictionary object. In the previous pipeline, the elements passed 
in from the BigQuery resuit set were already dictionary objects, 
while the elements passed in from the PubSub consumer are string 
objects. We’ll also reuse the DoFn function the past chapter which 
publishes elements to Datastore, listed in the snippet below. 


class PublishDoFn (beam.DoFn): 

def _init_ (self): 

from google.cloud import datastore 
self._ds = datastore 

def process (self, element): 

Client = self._ds.Ciient () 

key = client.key (' natality-guid' , element [' guid ']) 

entity = self._ds.Entity (key) 

entity[ 'weight' ] = element ['weight' ] 
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entity[' time' ] = element [' time' ] 
client.put(entity) 


Now that we liave defined the functions for model application 
and publishing to Datastore, we can build a streaming DAG with 
dataflow. The Python snippet below shows how to build a Dataflow 
pipeline that reads in a message stream frorn the natali ty subscrip- 
tion, applies the model application function, and then publishes 
the output to the application database. 

# set up pipeline parameters 
parser = argparse.ArgumentParser () 

known_args, pipeline_args = parser.parse_known_args (None) 
pipeline_options = PipelineOptions (pipeline_args) 


# define the topics 

topic = "projects/{project}/topics/{topic}" 

topic = topic.format (proj ect="your_project_name" , topi c="natality") 


# define the pipeline steps 

p = beam.Pipeline (options=pipeline_options) 

lines = p | 'Read PubSub' >> beam.io.ReadFromPubSub (topic=topic) 

scored = lines | 'apply' >> beam.ParDo(ApplyDoFn ()) 
scored | 'Create entities' >> beam.ParDo (Publi shDoFn ()) 

# run the pipeline 
resuit = p.run () 
result.wait_until_finish() 

The code does not explicitly state that this is a streaming pipeline, 
and the code above can be executed in a batch or streaming mode. 
In order to run this pipeline as a streaming Dataflow deployment, 
we need to specify the streaming flag as shown below. We can hrst 
test the pipeline locally before deploying the pipeline to GCP. For 
a streaming pipeline, it’s best to use GCP deployments, because 
the fully-managed pipeline can scale to match dernand, and the 
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Read PubSub 


apply 


Create entities 


Job summary 

System latency (seconds) Data freshness (seconds) 

Dec 19,2019 7:12 PM Dec19,2019 7:12PM 


t 2 1 min interval (mean) — 2 1 min interval (mean) 



7:15 

7:20 

7:25 

7:15 

• System latency: 



• Data freshness: 

Create alerting 11? 



Create alerting L? 


FIGURE 8.2: The Dataflow DAG with streaming metrics. 


platform will hanclle provisioning hardware and provide fault tol- 
erance. A visualization of the Dataflow pipeline rnnning on GCP 
is shown in Figure 8.2. 


python3 natality.py --streaming 

To test out the pipeline, we’ll need to pass data to the dsp topic 
which is forwarded to the natali ty subscription. The code snippet 
below shows how to pass a dictionary object to the topic using 
Python and the Google Clond library. The data passed to Pub¬ 
Sub represents a single record in the BigQuery resuit set from the 
previous chapter. 


import json 

from google.cloud import pubsub_vl 
import time 


data = json.dumps ({ 1 year 1 : 2001, 'plurality': 1, 

'apgar_5min ' : 99, ' mother_age ' : 33, 

1 father_age' : 40, 1 gestation_weeks' : 38, 'ever_born' : 8, 
'mother_married' : 1, 'weight': 6.8122838958, 

'time': str(time.time() ), 

'guid' : 'b281c5e8-85b2-4cbd-a2d8-e501ca816363' } 

) .encode ( 'utf-8' ) 


publisher = pubsub_vl.PublisherClient () 
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Entities 


□ CREATE ENTITY i IMPORT A EXPORT g DELETE 


QUERY BY KIND 


QUERY BY GQL 


Kind 

natality-guid 


=■ FILTER ENTITIES 


|~| Name/ID 

□ name=b281 c5e8-85b2-4cbd-a2d8-e501 ca81 


time 


1576810954.518128 


weight 


7.700608304146377 


FIGURE 8.3: The prediction output pushed to Cloud Datastore. 

topic_path = publisher.topic_path ( "your_project_name" , "natality") 
publisher.publish (topic_path, data=data) 

The resuit of passing data to the topic should be an updated entry 
in Datastore, which provides a model prediction for the passed in 
GUID. As more data is passed to the pipcline, additional entries 
will be added to the data set. A sample output of this pipeline is 
sliown in Figure 8.3, which displays the predicted weight for one 
of the records passed to the pipeline. To run the pipeline on GCP, 
run the following statement on the command line. 

python3 natality.py --streaming 
—runner DataflowRunner \ 

--project your_project_name \ 

--temp_location gs: //dsp_model_store/tmp/ \ 

We now liave a Dataflow streaming pipeline running in a fully- 
managed environment. We can use PubSub to interface the 
pipeline with other components in a cloud deployment, such as 
a data platform that receives real-time data from mobile appli- 
cations. With Dataflow, rnany components can be reused across 
batch and streaming model pipclines, which makes it a flexible 
tool for building production pipclines. One factor to consider when 
using Dataflow for streaming pipelines is that costs can be much 
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larger when using streaming versus batch operations, such as writ- 
ing to BigQuery 3 . 


8.3 Conclusion 

Streaming model pipelines are useful for Systems that need to ap- 
ply ML rnodels in real-time. To build these types of pipelines, we 
explored two message brokers that can scale to large volumes of 
events and provide data sources and data sinks for these pipelines. 
Streaming pipelines often constrain the types of operations you 
can perform, due to latency requirements. For example, it would 
be challenging to build a streaming pipeline that performs feature 
generation on user data, because historic data would need to be re- 
trieved and combined with the streaming data while maintaining 
low latency. There are patterns for achieving tliis type of resuit, 
such as precomputing aggregates for a user and storing the data 
in an application database, but it can be signihcantly more work 
getting this type of pipeline to work in a streaming mode versus a 
batch mode. 

We hrst explored Kafka as a streaming message platform and built 
a real-time pipeline using structure streaming and PySpark. Next, 
we built a steaming Dataflow pipeline reusing components from 
the past chapter that now interface with the PubSub streaming 
Service. Kafka is typically going to provide the best performance 
in ternis of latency between these two message brokers, but it takes 
signihcantly more resources to maintain this type of infrastructure 
versus using a managed solution. For small teams getting started, 
PubSub or Kinesis provide great options for scaling to match de- 
mand while reducing DevOps support. 

^https ://labs.spoti fy.com/2017/10/16/ 
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8.4 Thank You 

Writing this book has been a great experience for getting hands-on 
with many of the tools that I advocate for data scientists to learn 
and add to their toolbox. Thank you for reading through this book 
to the end. 

Given the breadth of tools covered in this book, 1 wasiTt able to 
provide much depth on any particular topic. The next step for read- 
ers is to choose a topic introduced in this text and find resources 
to learn about the topic in more depth than can be covered here. 

Data Science is a rapidly evolving field, and that means that the 
contents of this book will become outdated as cloud platforms 
evolve and libraries are updated. While I was authoring this text, 
substantial updates were released for TensorFlow which impacted 
the later chapters. The takeaway is that keeping up with data 
Science as a discipline requires ongoing learning and practice. 
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